CN104899230A

CN104899230A - Public opinion hotspot automatic monitoring system

Info

Publication number: CN104899230A
Application number: CN201410084317.8A
Authority: CN
Inventors: 李臻; 纪敏
Original assignee: Shanghai Boson Data Technology Co Ltd
Current assignee: Shanghai Boson Data Technology Co Ltd
Priority date: 2014-03-07
Filing date: 2014-03-07
Publication date: 2015-09-09

Abstract

The invention provides a public opinion hotspot automatic monitoring system, which comprises a Chinese automatic word segmentation module and a feature extraction module, wherein the Chinese automatic word segmentation module comprises an automatic word segmentation basic algorithm unit, a un-login work recognition unit, a Chinese automatic word segmentation ambiguity eliminating unit; and the feature extraction module comprises a feature expressing unit and a vector space model unit. A hotspot automatic monitoring technology is a key link of the system. Government users can timely and fast know and master the current hotspot on Internet through the automatic discovery of the public opinion hotspot, and a great promotion effect is achieved on comprehensive mastering of the Internet public opinion.

Description

Public sentiment hot automatic monitoring system

Technical field

The present invention relates to internet public feelings information obtain and utilize, particularly relate to public sentiment hot automatic monitoring system.

Background technology

At present, we, in internet public feelings information obtains and utilizes, still also exist larger gap with job requirement, are mainly manifested in:

1, the acquisition capability of internet information is not enough.In the face of the internet data that complexity is huge, type is various, data volume is huge, can't find comprehensively, quickly and efficiently and obtain the information meeting actual needs.The data total amount obtained by classic method is few, coverage rate is narrow, it is single to originate, and the work that significantly limit effectively is carried out;

2, the excavation processing power of internet information is not enough.For the internet data obtained, can not need according to real work, Develop Data excavates process, therefrom finds out event context and reason, finds out the internal relation between main body, the Timeliness coverage hot spot of society, predicted events development trend etc.;

3, the internet public feelings monitoring analysis system be suitable for is lacked.Also do not set up the internet public feelings monitoring analysis application system meeting need of work, can not process magnanimity internet data, can not Timeliness coverage network public-opinion focus, disposal preliminary work can not be carried out.

In order to safeguard social and political stability, strengthening internet management, organizational strength expansion internet public feelings monitoring analysis becomes the realistic problem that current prevailing governmental department is badly in need of solving.Address this problem, need the public sentiment monitoring analysis system of intelligence, be used for realizing for the automatically real-time monitoring analysis of internet mass public sentiment, thus effectively solve the enforcement difficult problem that government department monitors network public-opinion with traditional manual type.Public sentiment monitoring analysis system needs to integrate Internet technology and information intelligent treatment technology, to within the border, overseas internet mass information carry out automatic capturing and analysis, realize the information requirement of network public-opinion hot spot monitoring and analysis, dynamically provide analysis foundation for government grasps masses' thought comprehensively.

Carry out government department's internet net public sentiment monitoring analysis system research in time, build the Web mining application system into real work service, there is high importance and urgency.

In sum, for the deficiency that prior art exists, public sentiment hot automatic monitoring system is needed especially, to solve the deficiencies in the prior art.

Summary of the invention

The object of this invention is to provide bicycle theft-prevention network monitoring system for things, solve bicycle in campus and arbitrarily park the phenomenon often stolen with bicycle.

The technical scheme that the present invention adopts for its technical matters of solution is,

Public sentiment hot automatic monitoring system, this system includes automatic segmentation of Chinese word module, characteristic extracting module;

Automatic segmentation of Chinese word module include automatic word segmentation rudimentary algorithm unit, do not log in the recognition unit of word, the overcome ambiguity of automatic segmentation of Chinese word and eliminate unit;

Spy carries extraction module and includes character representation unit, vector space model unit;

The step of the automatic monitoring method of this system is as follows:

1, one section of report is read in from data source, monitor incessantly multiple Internet news data source, from network, automatic capturing news report, parses the time of news report, title and text message etc., if do not find the time from report, be then as the criterion with the crawl time;

Owing to there is suitable repetition between multiple data source, to the new news report captured, carrying out disappearing according to the content of text of report heavily processes; If new report and news report multiplicity treated are before greater than repetition threshold value θ d, then think the news report of repetition, the repetition threshold value θ d set in the present embodiment is 0.9;

Because the scope of news report is too wide in range, adopt the method that rule classification and content-based automatic classification based on source combine, news report is classified, rule classification is classified according to source of news and author etc., content-based automatic classification adopts vector space model (VSM) and algorithm of support vector machine (SVM), carries out automatic classification according to Reporting and title to news report; And the process of step 2-step 7 is carried out according to generic c;

2, barycenter comparison strategy is adopted, report and the existing theme of news monitored in generic c are compared, consider temporal characteristics and content characteristic simultaneously, calculate the similarity between report and theme, and record maximum similarity Smax and the maximum theme Es of similarity, determine the theme the most close with current report; Theme is expressed itself by several Feature Words that comprehensive weight in all news in theme inside is the highest; Similarity between news report and theme, based on vector space model, is calculated by both included angle cosine values (cosine), and the title of simultaneously it is reported gives higher weights;

3, the maximum similarity Smax calculated according to the step 2 and maximum theme Es of similarity, takes following measure to current report:

If A. Smax is less than in innovation threshold value θ n(the present embodiment is 0.25): in this report generic, create a new theme;

If B. Smax is greater than θ n and is less than in cluster threshold value θ c(the present embodiment is 0.30): do not deal with, return step 1);

If C. Smax is greater than θ c and is less than in contribution threshold θ t(the present embodiment is 0.35): be included into current topic;

If D. Smax is greater than θ t: be included into theme Es, and adjust Es;

The span of above-mentioned Smax, θ n, θ c, θ t is all greater than 0 and is less than or equal to 1;

After the newly-increased report of the fixed qty 4, determined as a class process user, theme of news in this classification is compared between two; If the similarity of two themes is greater than merge threshold value θ u, then merged, calculating formula of similarity between theme can adopt the method calculating two cluster similarities in traditional clustering algorithm, considers the similarity between two between all news report in two themes, adopts following formula:

Sim (E_{1}, E_{2}) = \frac{\underset{d_{i} &Element; E_{1}}{Σ} \underset{d_{j} &Element; E_{2}}{Σ} sim (d_{i}, d_{j})}{| E_{1} | \cdot | E_{2} |}

Wherein, E1, E2 are two themes of news monitored, and di, dj are respectively E1, the news report in E2, Sim(E1, E2) be two news report between similarity, | E ₁|, | E ₂| be respectively the news report number comprised in two themes;

After the newly-increased report of the fixed qty 5, determined as a class process user, news report in each theme is eliminated: recalculate news report and the similarity of this theme, similarity is eliminated lower than cluster threshold value θ c or the news report that do not meet restrictive condition; And then recalculate theme internal representation and weight thereof;

If the theme quantity in 6 current class exceedes theme window size, all themes of news in classification are sorted: in conjunction with time response and the quantitative characteristics of theme of news, calculate the score value of theme of news and sort; Consider multiple different sequence when calculating score value simultaneously, consider nearest 12 hours, 1 day, 3 days, 7 days, 30 days simultaneously, only have when theme in any sequence not in theme window time, just this theme is eliminated; Like this, multiple sequence just provide the user varigrained information reference, and the theme of news not in theme window is eliminated by system, for improving the efficiency of system process;

7, according to user's requirement, externally export monitoring result: for the current all themes in classification, calculate it and describe; Simultaneously, in conjunction with the news report quantitative characteristics in the time response of theme and theme, several themes of news that score is the highest are selected from all categories, as the theme of news of this classification hottest point, the news report list exporting subject description and comprise, wherein, the generative process of subject description is as follows:

A. several Feature Words that the inner weight of theme is the highest are read;

B. in it is reported in the theme being greater than theme threshold value θ e with Topic Similarity, the title of one section of news report that select time is nearest; Theme threshold value can also take mode proportionally;

C. comprehensive A and B, exports the description of this theme.

Further, the rudimentary algorithm unit of described automatic word segmentation includes maximum matching method, complete syncopate algorithm, probability multiplication algorithm;

Maximum matching method is the connection possibility that the algorithm of mechanical Chinese word segmentation does not consider between any word and word, only in dictionary, find according to length the character string occurred in sentence: method comparison is ripe, representative with maximum matching method (Maximum Matching is called for short MM) most;

The thought of maximum matching method is: from inlet flow, get maximum length (getting 6 in native system) character string, search in dictionary, and coupling then exports, and continues to get, otherwise backtracking, continues to search, until length is 1, now need advance lattice in inlet flow, and this process proceeds to take inlet flow;

Complete syncopate algorithm is a kind of algorithm that there is not cutting blind spot; So-called complete syncopate algorithm, obtains all divided form meeting dictionary in form exactly; Adopt this algorithm, relate to the problem that is chosen optimum splitting type;

Probability multiplication algorithm is that Statistics-Based Method utilizes the co-occurrence between word and word, between word and word as the foundation of participle; The advantage of this method is it not by the restriction of application, nor is confined to the dictionary for word segmentation realizing foundation; The method needs large-scale training text, in order to training pattern parameter;

The selection of training text also produces significantly impact by the result of participle;

If S=s1, s ..., sm is Chinese character string to be slit, supposes that S has n splitting type, W=w1, w2 ..., wk is i-th splitting type, i=l ~ n;

If being P(W/S) Chinese character string S cutting is the probability of W, then the segmenting method of Corpus--based Method is exactly the splitting type finding maximum probability from n the splitting type of S,

I.e. P(W/S)=MAX(P(W1/S), P(W2/S) ..., P(Wn/S)), P(W/S) be called evaluation function;

According to Bayesian formula, have: P(W/S)=P(W) P(S/W)/P(S) for the multiple slit mode of S, P(S) be a constant, and P(S/W) be the probability occurring sentence S under the condition of given word string, therefore P(S/W)=l, so P(W/S) ≈ P(W).

Further, the described recognition unit not logging in word includes two performance index:

1. recall rate (Recall): refer to the ratio belonging to the unregistered word sum of the type in the quantity of the unregistered word of certain type identified and text,

2. accurate rate (Accuracy): refer in the unregistered word identified, belongs to the ratio of the number of the type unregistered word and the sum of identification unregistered word out.

Further, the overcome ambiguity of described automatic segmentation of Chinese word and elimination unit thereof: the participle of Chinese is a process understood, this process synthesis various information such as the administration of justice, grammer, semanteme, the utilization of automatic segmentation of Chinese word and these information is the complementary relations of one not only having connected each other but also mutually restricted, and pure mechanical cutting will inevitably bring overcome ambiguity;

Overcome ambiguity refers to some field in Chinese sentence, if purely do simple string matching according to vocabulary, then may there is multiple divided form in it, Chinese character string containing overcome ambiguity is called ambiguous phrase, overcome ambiguity is a difficult point in Survey of Chinese Word Segmentation, and the fundamental type of overcome ambiguity has three kinds:

1. overlapping ambiguity: also claim overlap type overcome ambiguity, namely Chinese character string ABC both can be cut into AB/C form, also can be cut into A/BC form, and namely AB is word, and BC is also word;

2. combinational ambiguity: Chinese character string AB both can be cut into AB, can be cut into A/B again, and namely AB is word, and A, B are also words;

3. mix ambiguity: self of first two ambiguity form is nested or produced by both combined crosswise;

The method mainly solving overcome ambiguity has two kinds: rule-based method, Statistics-Based Method.

Further, described character representation unit generally arranges 3 dictionaries, main dictionary, synonymicon, contains word dictionary, carries out word frequency statistics simultaneously, when carrying out word frequency statistics and feature extraction, with the subject term bar in main dictionary for representing that entry processes, its entry Frequency statistics formula is:

Tf = TMf + Σ_{i = 1}^{m_{i}} {TTf}_{i} + eI Σ_{i = 1}^{m_{2}} {TIf}_{i}

Namely represent that entry appearance frequency is in a document obtained by subject term bar, synonym entry, the word frequency number weighted cumulative that contains word entry 3 part;

Wherein:

Tf: the word frequency of subject term bar f

TMf: the word frequency weights of subject term bar f

TTf _i: synonym entry f _iword frequency weights, total m _iindividual synonym

TIf _i: contain entry f _iword frequency weights, total m ₂individually contain word

E: weighted value

In VSM, text document is considered as by one group of entry (T1, T2 ... .Tn) form, each entry is all composed with certain weight w _i, thus each section of document is all mapped as a vector in the vector space be made up of one group of entry vector, the matching problem of text just can be converted into the Vectors matching issue handling in vector space.

Further, the basic thought of described vector space model unit be with vector to represent text: (w1, w2, wn), wherein Wi is the weight of i-th characteristic item, what is so chosen as characteristic item, generally word can be selected, word or phrase, experimentally result, generally believe that choosing word is better than word and phrase as characteristic item, therefore, it to be a vector in vector space by text representation, just first will by text participle, text is represented as the dimension of vector by these words, initial vector representation is 0 completely, 1 form, namely, if there is this word in text, so this dimension of text vector is 1, otherwise be 0, this method cannot embody this word effect degree in the text, so gradually 0, 1 is replaced by more accurate word frequency, word frequency is divided into absolute word frequency and Relative Word frequency, absolute word frequency, even if the frequency representation text that word occurs in the text, Relative Word frequency is normalized word frequency, its computing method mainly use TFIDF formula, there is multiple TFIDF formula at present, we have employed a kind of commonplace TFIDF formula in systems in which:

Wherein, WYtidY is the weight of word t in text d, and tfYtidY is the word frequency of word t in text d, and N is the sum of training text, n _ifor training text concentrates the textual data occurring t, denominator is normalized factor;

Also there is other TFIDF formula in addition, such as:

W(t，d)＝(1+log2tf(t，d))×log2(N/ni)t∈d(1+log2tf(t，d))×log2(N/ni)2

The implication of this Parameters in Formula is identical with above formula;

TFIDF vector reflects the individual character space of Training document collection, its the corresponding individual character of each component of a vector, the size of component features the ability that document content attribute distinguished in this individual character, the scope that individual character occurs in document sets is wider, illustrate that the ability of its differentiation document properties is lower, on the other hand, the frequency that it occurs in a specific document is higher, illustrate that it is to distinguish the ability in the document contents attribute stronger, it belongs to the word set representation with document, namely all words extract from document, and the structure of the order abandoned between consideration word and text.

The invention has the advantages that, this system focus automatic monitoring technical is key link.The automatic discovery of public sentiment hot can make government customer have the current hotspot on internet to understand and grasp efficiently in time, all will serve very large impetus for grasping network public-opinion comprehensively, it is " the concern information " that the concern content arranged according to user produces that information monitoring analysis comprises two kinds: one; Two is the indexs such as keyword, network browsing number change, forum's money order receipt to be signed and returned to the sender number change arranged according to user, and system carries out hot spot monitoring respectively, then " hot news " that each classification focus comprehensive is selected.Each concern information and hot information can provide original text title, source, time, touching quantity, synopsis etc., and public sentiment monitoring analysis system, according to the form of user's actual needs, generates focus and reports to the police.

Accompanying drawing explanation

The present invention is described in detail below in conjunction with the drawings and specific embodiments:

Fig. 1 is that the present invention proposes public sentiment hot automatic monitoring flow gauge figure;

Embodiment

The technological means realized to make the present invention, creation characteristic, reaching object and effect is easy to understand, below in conjunction with diagram and specific embodiment, setting forth the present invention further.

The public sentiment hot automatic monitoring system that the present invention proposes, this system includes automatic segmentation of Chinese word module, characteristic extracting module;

The prerequisite of text mining carries out automatic segmentation of Chinese word exactly.The wirtiting mode of Chinese is using Chinese character as least unit, but word is significant minimum treat unit in the middle of natural language understanding.Namely there is no the Chinese character string on the border of word to be transformed into meet the word string of language reality not splitting mark, in written Chinese, namely setting up the border of word, Here it is automatic segmentation of Chinese word.Like this, the Chinese natural language comprising the Chinese-outer mechanical translation is understood, and what first run into is the unbridgeable automatic segmentation of Chinese word stage.Automatic segmentation of Chinese word is not only the necessary links of mechanical translation, is also the foundation works that various Chinese information processing comprises the work such as speech processes, word frequency statistics, Index Transform of Topic Words, Text summarization, information retrieval, Chinese parsing.

Modern Chinese texts automatic word segmentation is the basis of Chinese information processing.Chinese text does not have the boundary marker of the explicit sign word in similar English space and so on.The task of automatic segmentation of Chinese word automatically will add space between word and word by machine exactly in Chinese text.

The main contents of Survey of Chinese Word Segmentation comprise:

L. standard of word segmentation problem: determine that what is word, which can as word segmentation unit.

2. segmentation algorithm problem: the cutting how carrying out word, to set up the border of the word of realistic connotation.

3. disambiguation problem: take which type of method to eliminate overcome ambiguity.

4. unknown word identification: the identification how carrying out unregistered word in dictionary, as place name, name and translated name.

The step of the automatic monitoring method of this system is as follows:

If E. Smax is less than in innovation threshold value θ n(the present embodiment is 0.25): in this report generic, create a new theme;

If F. Smax is greater than θ n and is less than in cluster threshold value θ c(the present embodiment is 0.30): do not deal with, return step 1);

If G. Smax is greater than θ c and is less than in contribution threshold θ t(the present embodiment is 0.35): be included into current topic;

If H. Smax is greater than θ t: be included into theme Es, and adjust Es;

Sim (E_{1}, E_{2}) = \frac{\underset{d_{i} &Element; E_{1}}{Σ} \underset{d_{j} &Element; E_{2}}{Σ} sim (d_{i}, d_{j})}{| E_{1} | \cdot | E_{2} |}

D. several Feature Words that the inner weight of theme is the highest are read;

E. in it is reported in the theme being greater than theme threshold value θ e with Topic Similarity, the title of one section of news report that select time is nearest; Theme threshold value can also take mode proportionally;

F. comprehensive A and B, exports the description of this theme.

Further, the method for automatic segmentation of Chinese word can be classified according to different standards.Can be divided into have dictionary and no dictionary cutting word according to whether having dictionary for word segmentation; Rule-based method and Statistics-Based Method etc. can be divided into again according to the knowledge resource used in participle process.Have Dictionary based segment to be the main flow of automatic segmentation of Chinese word, the rudimentary algorithm of participle is maximum matching method.Increasing research based on statistics, and combines with rule-based method.The method that namely this paper system adopts statistic rule to combine.

Rule-based method generally all needs there is the dictionary for word segmentation manually established in advance.Mating one by one cutting text during participle, is exactly word segmentation result with the worthy word string of the word in dictionary for word segmentation.Mainly contain Forward Maximum Method method, reverse maximum matching method, bilateral scanning method, by word traversal matching method, set up cutting notation, and forward optimum matching and reverse Best Match Method etc.

Statistics-Based Method utilizes the co-occurrence between word and word, between word and word as the foundation of participle, the dictionary for word segmentation that can not establish in advance.Statistics-Based Method needs large-scale training text, in order to training pattern parameter, and no matter is training text or actual cutting, generally all needs larger calculated amount.

The rudimentary algorithm unit of described automatic word segmentation includes maximum matching method, complete syncopate algorithm, probability multiplication algorithm;

Probability multiplication algorithm is that Statistics-Based Method utilizes the co-occurrence between word and word, between word and word as the foundation of participle; The advantage of this method is it not by the restriction of application, nor is confined to the dictionary for word segmentation realizing foundation; The method needs large-scale training text, in order to training pattern parameter; The selection of training text also produces significantly impact by the result of participle;

3. recall rate (Recall): refer to the ratio belonging to the unregistered word sum of the type in the quantity of the unregistered word of certain type identified and text,

4. accurate rate (Accuracy): refer in the unregistered word identified, belongs to the ratio of the number of the type unregistered word and the sum of identification unregistered word out.

Further, character representation refers to and represents document with certain characteristic item (as entry or description), only need process these characteristic items when text classification or cluster, thus realize the process to non-structured text, this is the treatment step that a destructuring transforms to structuring.Character representation is document class general character and regular generalization procedure, and be the core of classification or clustering system, the quality of feature extraction algorithm directly has influence on the effect of document classification or cluster.

Character representation model has multiple, and conventional has Boolean logic type, probabilistic type, vector space type etc.The vector space model (Vector Space Model, VSM) that application is more is adopted in literary composition.

Its advantage content of text is converted to the easy vector mode for mathematics manipulation, makes various similar op and sequence become possibility.Therefore, in text retrieval, text filtering and text snippet etc., obtain widespread use, achieve good result.

But vector space model is difficult to meet about the basic assumption (orthogonality hypothesis) that relation between word is separate in actual environment, and the word occurred in literary composition often exists certain correlativity, namely occurs " oblique " situation.Occur that the reason of this situation is the diversity of natural language.What state as " computing machine ", " computer ", " robot calculator " these two vocabularys is exactly same concept, if do not note this point, extracts respectively, will cause that feature is not obvious and feature set is too huge.In order to solve language diverse problems, give consideration arranging of dictionary.

Described character representation unit generally arranges 3 dictionaries, main dictionary, synonymicon, contain word dictionary, carry out word frequency statistics simultaneously, when carrying out word frequency statistics and feature extraction, with the subject term bar in main dictionary for representing that entry processes, its entry Frequency statistics formula is:

Tf = TMf + Σ_{i = 1}^{m_{i}} {TTf}_{i} + eI Σ_{i = 1}^{m_{2}} {TIf}_{i}

Wherein:

Tf: the word frequency of subject term bar f

TMf: the word frequency weights of subject term bar f

TTf _i: synonym entry f _iword frequency weights, total m _iindividual synonym

E: weighted value

In VSM, text document is considered as by one group of entry (T1, T2 ... .Tn) form, each entry is all composed with certain weights W _i, thus each section of document is all mapped as a vector in the vector space be made up of one group of entry vector, the matching problem of text just can be converted into the Vectors matching issue handling in vector space.

Word, phrase and phrase are the fundamental elements of composition document, and the frequency of occurrences in a document has certain regularity, is suitable as the characteristic item of document.The effect in a document of different entries is different:

everyday words and rare word: everyday words (such as and etc. function word) has very high appearance frequency in all documents, the number of times that rare word then occurs in whole Training document is all little, specific being difficult to of word frequency statistics of this two classes word is determined, is not suitable as characteristic item, should give filtering.

the frequency having some words to occur in all documents is all substantially identical, and distinction is poor, also can not answer filtering as characteristic item.

phrase and phrase: compared with simple vocabulary, the ability to express of phrase and phrase is strong, more can show document content, and the employing phrase more than therefore should trying one's best and phrase, as characteristic item, improve the expression ability of characteristic item.

Also there is other TFIDF formula in addition, such as:

W(t,d)=(1+log2tf(t,d))×log2(N/ni)t∈d(1+log2tf(t,d))×log2(N/ni)2

The implication of this Parameters in Formula is identical with above formula;

Flow process describes:

First, algorithm eliminates the possibility that feature is done in those individual character choosings appeared in inactive vocabulary;

Then, get rid of again those individual characters that occurrence frequency is very low in document sets as feature, this point can such as, by carrying out individual character frequency statistics to document sets and select a suitable threshold value to accomplish, minimum occurrence number=5 of selected characteristic word string.This feature extraction process will carry out the scanning of multipass to document sets, the pass of scanning is set by people, and its value can be decided by the most major term string length of feature.First pass is all not to be present in inactive vocabulary and the individual character having enough occurrence frequencies is incorporated into Feature Words string list concentrates.Be 2 utilize multiple screening criteria to extract to the word string of most major term string length for length.At each all in scanning, all documents all utilize a window queue to carry out checking to obtain word string one by one one by one individual character, symbol in each document is to enter this window, must meet it is a correct individual character (but not a numeral or special symbol), and it not to be present in inactive vocabulary and to belong to current word string collection; The reset otherwise this window is cleared.

Text, after participle program participle, first removes stop words, merges the vocabulary such as numeral and name, then adds up word frequency, be finally expressed as above-described vector.

Feature extraction:

The dimension of the proper vector obtained through above step is very high, the feature of higher-dimension like this may not be important, useful to the classification learning being about to carry out entirely, and the feature of higher-dimension will increase the learning time of machine greatly, and produce the learning classification result with much smaller character subset.This is the work that feature extraction will complete, and feature extraction is generally structure evaluation function, and assess each feature, choose point value of evaluation high, the best features of predetermined number is as character subset.The experiment proved that, the modulus value of front 30 (sorting from high in the end by weights) generally accounts for more than 80% of characteristic item modulus value, 80th later item is very little on the impact of whole vector, therefore, consider operational efficiency, the characteristic item that have chosen first 50 of weights in systems in which simply forms final proper vector.

Feature extraction plays an important role in text classification, can play reduction dimension of a vector space, simplifies calculating, prevent the effects such as undue matching.Owing between the quantity of character subset and feature quantity being the relation of index, it is almost impossible for enumerating, therefore we suppose it is independently between feature, the extraction of such character subset is just converted into the extraction of characteristic item, and the score value of each feature is calculated according to certain feature evaluation function, then press score value sequence, choose several scorings the highest as Feature Words.Feature extraction that Here it is.

As far as possible the major function of feature selecting reduces word number to be processed when not damage classifying precision, reduces dimension of a vector space, thus improve speed and the efficiency of classification work with this.Therefore, feature selecting concerning raising nicety of grading, even helpful, to different sorters rise effect difference.

In text-processing, some valuation functions being usually used in feature extraction have document frequency (Document Frequency), information gain (information Gain), expect cross entropy (Expected Cross Entropy), mutual information (Mutual Information), χ ²statistics (CHI), text weight evidence (The Weight Ofevidence For Text) and probability ratio (Odd Ratio) etc.

1, document frequency DF

It is the simplest valuation functions, and its value is the textual data that in training set, this word occurs.The rare word of theory hypothesis of DF valuation functions may not contain useful information, also may be not enough to have an impact to classification very little, also may be noise, therefore can leave out.Obviously it is more much smaller than other valuation functions in calculated amount, but its effect is fine in practice.The shortcoming of DF is that rare word may be unrare in a certain class text, also may contain important judgement information, give up simply, may affect the precision of sorter.Therefore, in practice, generally directly DF is not used.

2, information gain IG

Information gain is often applied in machine learning field, and it occurs by text feature the quantity of information calculating this feature with absent variable situation in the text.Be defined as the difference that the information entropy of front and back appears in a certain feature in the text.

3, the expectation cross entropy of word t in text

The unique difference of it and information gain is not consider the nonevent situation of word.

4, mutual information MI

In statistics, mutual information, for characterizing the correlativity of Two Variables, is often used as the standard of statistical model that text feature is correlated with and related application thereof.

5, χ ²estimate (CHI)

Identical with mutual information, χ ²estimate also for characterizing the correlativity of Two Variables.When feature is given a mark, calculating be dependence between feature t and class c.χ ²it is better than mutual information to estimating of text feature to estimate, exists and non-existent situation because it considers feature simultaneously.

If separate between t and c, so χ of text feature t ²estimated value is zero.χ ²estimate that with the key distinction of mutual information be χ ²for standard value, the χ of the feature in therefore similar ²value is comparable.But χ ²estimate that the marking for characteristics of low-frequency is not accurate enough, so, adopt χ ²when estimating that carrying out text feature extracts, first should get rid of a part of low-frequency word according to the text frequency of feature, more remaining feature is given a mark, reasonable effect can be obtained like this.

6, text weight evidence (Weight Of Evid Txt)

This is a kind of newer valuation functions, when it weighs the probability of class and given feature class conditional probability between difference.In text-processing, do not need all probable values calculating t, and only consider the situation that t occurs in the text.

7, dominant ratio (Odds Ratio)

Dominant ratio is only applicable to the situation of binary classification, and its feature is just concerned about the score value of text feature for target class.In formula, pos represents target class, and neg represents non-target class.

Through comparing several valuation functions, herein the Systematic selection good CHI method of effect carries out Text character extraction.

Because existing theme monitoring technology mainly considers the fallout ratio of closing in fixing small data set and loss, when the automatic monitoring being applied to public sentiment hot, there is the defects such as the superseded and subject description of theme sequence, topic similarity, report.For these problems, propose a kind of new public sentiment hot monitoring method herein, the method utilizes the feature of public sentiment hot itself, merge with adjustment by the sequence of introducing theme, theme, report the steps such as superseded and subject description, realize flowing to Mobile state, efficiently hot spot monitoring, as process flow diagram 1 to lasting news.

System will safeguard subject information list and news report information list:

The theme of news information of subject information list maintenance some, the structure of each subject information is as follows:

typedef struct struTopicInfo{

Int sequence; // theme sequence number;

Int parent; // affiliated thematic sequence number;

Int firstDoc; // the first section of report sequence number;

Int last Doc; // last section report sequence number;

Int docsCount; // report number;

DocumentFeature feature [FeatureWordsNum]; // theme feature vector;

Char title [TopicTitleLength]; // title;

Char summary [TopicSummaryLength]; // summary;

}TDTTopicInfo；

In addition, system will set size and the cluster threshold value of theme window and document window and innovate threshold value, and wherein cluster threshold value is greater than innovation threshold value.

Const RecentTopicsNnm=25; Theme window size in // time window strategy;

Const WindowSize=1000; The size of window in // time window strategy, the i.e. number of document in window;

Const double TDTClusterThreshold=0.10; // cluster threshold value;

Const double TDTNoveltyThreshold=0.095; // innovation threshold value;

News report information list then safeguards nearest some report information, and the structure of each section of report information is as follows:

typedef struct struDocumentInfo{

Int sequence; // report sequence number;

Int parent; // affiliated theme sequence number;

Float score; // score value;

Char URI [URI_Length]; Path deposited by // concrete file;

DocumentFeature feature [feature WordsNum]; // word feature vector;

Int nextDoc; // with next chapter report sequence number in theme;

}TDTDocumentInfo；

System initialization reads the report information of existing subject information, set up subject information list and report information list, finally by subject information list and report information list all the elements with in file.

Monitoring result can also derive by system, forms XML file.The DTD of subject information is:

Based on above-mentioned, advantage of the present invention is: this system focus automatic monitoring technical is key link.The automatic discovery of public sentiment hot can make government customer have the current hotspot on internet to understand and grasp efficiently in time, all will serve very large impetus for grasping network public-opinion comprehensively.

More than show and describe ultimate principle of the present invention, principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; the present invention also has various changes and modifications without departing from the spirit and scope of the present invention, and these changes and improvements all fall in the claimed scope of the invention.Application claims protection domain is defined by appending claims and equivalent thereof.

Claims

1. public sentiment hot automatic monitoring system, is characterized in that, this system includes automatic segmentation of Chinese word module, characteristic extracting module;

The step of the automatic monitoring method of this system is as follows:

1), one section of report is read in from data source, monitor incessantly multiple Internet news data source, from network, automatic capturing news report, parses the time of news report, title and text message etc., if do not find the time from report, be then as the criterion with the crawl time;

2), barycenter comparison strategy is adopted, report and the existing theme of news monitored in generic c are compared, consider temporal characteristics and content characteristic simultaneously, calculate the similarity between report and theme, and record maximum similarity Smax and the maximum theme Es of similarity, determine the theme the most close with current report; Theme is expressed itself by several Feature Words that comprehensive weight in all news in theme inside is the highest; Similarity between news report and theme, based on vector space model, is calculated by both included angle cosine values (cosine), and the title of simultaneously it is reported gives higher weights;

3) the maximum similarity Smax, according to step 2 calculated and the maximum theme Es of similarity, takes following measure to current report:

If D. Smax is greater than θ t: be included into theme Es, and adjust Es; The span of above-mentioned Smax, θ n, θ c, θ t is all greater than 0 and is less than or equal to 1;

4), after the newly-increased report of the fixed qty, as a class process user determined, theme of news in this classification is compared between two; If the similarity of two themes is greater than merge threshold value θ u, then merged, calculating formula of similarity between theme can adopt the method calculating two cluster similarities in traditional clustering algorithm, considers the similarity between two between all news report in two themes, adopts following formula:

5) after the newly-increased report of the fixed qty, determined as a class process user, news report in each theme is eliminated: recalculate news report and the similarity of this theme, similarity is eliminated lower than cluster threshold value θ c or the news report that do not meet restrictive condition; And then recalculate theme internal representation and weight thereof;

6) if the theme quantity in current class exceedes theme window size, all themes of news in classification are sorted: in conjunction with time response and the quantitative characteristics of theme of news, calculate the score value of theme of news and sort; Consider multiple different sequence when calculating score value simultaneously, consider nearest 12 hours, 1 day, 3 days, 7 days, 30 days simultaneously, only have when theme in any sequence not in theme window time, just this theme is eliminated; Like this, multiple sequence just provide the user varigrained information reference, and the theme of news not in theme window is eliminated by system, for improving the efficiency of system process;

7), according to user's requirement, externally export monitoring result: for the current all themes in classification, calculate it and describe; Simultaneously, in conjunction with the news report quantitative characteristics in the time response of theme and theme, several themes of news that score is the highest are selected from all categories, as the theme of news of this classification hottest point, the news report list exporting subject description and comprise, wherein, the generative process of subject description is as follows:

C. comprehensive A and B, exports the description of this theme.

2. public sentiment hot automatic monitoring system according to claim 1, is characterized in that, the rudimentary algorithm unit of described automatic word segmentation includes maximum matching method, complete syncopate algorithm, probability multiplication algorithm;

3. public sentiment hot automatic monitoring system according to claim 1, is characterized in that, the described recognition unit not logging in word includes two performance index:

1). recall rate (Recall): refer to the ratio belonging to the unregistered word sum of the type in the quantity of the unregistered word of certain type identified and text,

2). accurate rate (Accuracy): refer in the unregistered word identified, belongs to the ratio of the number of the type unregistered word and the sum of identification unregistered word out.

4. public sentiment hot automatic monitoring system according to claim 1, it is characterized in that, the overcome ambiguity of described automatic segmentation of Chinese word and elimination unit thereof: the participle of Chinese is a process understood, this process synthesis various information such as the administration of justice, grammer, semanteme, the utilization of automatic segmentation of Chinese word and these information is the complementary relations of one not only having connected each other but also mutually restricted, and pure mechanical cutting will inevitably bring overcome ambiguity;

1). overlapping ambiguity: also claim overlap type overcome ambiguity, namely Chinese character string ABC both can be cut into AB/C form, also can be cut into A/BC form, and namely AB is word, and BC is also word;

2). combinational ambiguity: Chinese character string AB both can be cut into AB, can be cut into A/B again, and namely AB is word, and A, B are also words;

3). mixing ambiguity: self of first two ambiguity form is nested or produced by both combined crosswise;

5. public sentiment hot automatic monitoring system according to claim 1, it is characterized in that, described character representation unit generally arranges 3 dictionaries, main dictionary, synonymicon, contains word dictionary, carry out word frequency statistics simultaneously, when carrying out word frequency statistics and feature extraction, with the subject term bar in main dictionary for representing that entry processes, its entry Frequency statistics formula is:

Wherein:

Tf: the word frequency of subject term bar f

TMf: the word frequency weights of subject term bar f

TTf _i: synonym entry f _iword frequency weights, total m _iindividual synonym

E: weighted value

6. public sentiment hot automatic monitoring system according to claim 1, it is characterized in that, further, the basic thought of described vector space model unit be with vector to represent text: (w1, w2, wn), wherein Wi is the weight of i-th characteristic item, what is so chosen as characteristic item, generally word can be selected, word or phrase, experimentally result, generally believe that choosing word is better than word and phrase as characteristic item, therefore, it to be a vector in vector space by text representation, just first will by text participle, text is represented as the dimension of vector by these words, initial vector representation is 0 completely, 1 form, namely, if there is this word in text, so this dimension of text vector is 1, otherwise be 0, this method cannot embody this word effect degree in the text, so gradually 0, 1 is replaced by more accurate word frequency, word frequency is divided into absolute word frequency and Relative Word frequency, absolute word frequency, even if the frequency representation text that word occurs in the text, Relative Word frequency is normalized word frequency, its computing method mainly use TFIDF formula, there is multiple TFIDF formula at present, we have employed a kind of commonplace TFIDF formula in systems in which:

Also there is other TFIDF formula in addition, such as:

W(t,d)=(1+log2tf(t,d))×log2(N/ni)t∈d(1+log2tf(t,d))×log2(N/ni)2

The implication of this Parameters in Formula is identical with above formula;