CN105354216B

CN105354216B - A kind of Chinese microblog topic information processing method

Info

Publication number: CN105354216B
Application number: CN201510627783.0A
Authority: CN
Inventors: 赵妍妍; 秦兵; 李泽魁
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.
Priority date: 2015-09-28
Filing date: 2015-09-28
Publication date: 2018-09-07
Anticipated expiration: 2035-09-28
Also published as: CN105354216A

Abstract

A kind of Chinese microblog topic information processing method, the reason of being distributed the present invention relates to microblogging event emotion parser.The present invention is that the relevant microblogging of event cannot be divided under correct theme in order to which the accuracy rate for solving the hierarchical clustering algorithm and correct algorithm that are used in current microblog topic information processing method is low.The present invention carries out the excavation of episode topic and its relevant microblog using the two methods of microblog topic correct algorithm of the hierarchical clustering sort method and semi-supervised learning of unsupervised learning, is finally reached the purpose that emotion distribution statistics and analysis are carried out to relevant microblog.The present invention can more accurately carry out microblog topic information processing.The present invention is applied to microblog topic field of information processing.

Description

A kind of Chinese microblog topic information processing method

Technical field

The present invention relates to microblog topic information processing methods.

Background technology

Microblogging is as one of emerging social media platform, and domestic most popular social media platform, and there is numbers Any active ues in terms of hundred million, more and more netizen's selections obtain on microblogging and share oneself interested information, in microblogging In face of the big data of average daily millions, analysis netizen is one with attitude to the viewpoint of a certain event and significantly works, More and more scholars begin to focus on the information of big data behind as microblogging.

Due to microblogging as the form of social media enter into people life time it is not long, so both at home and abroad towards microblogging Event emotion distribution the analysis of causes correlative study be not very much, microblogging event method for digging at this stage mainly has, 2011 Year, Weng et al. is passed through by using in microblogging text the relative theory of wavelet transformation in the monitoring of some term frequencies It analyzes its autocorrelation sieves and selects burst vocabulary, cluster as accident (document [1]：Weng J,Lee B S.Event Detection in Twitter[J].ICWSM,2011,11:401-408), this method has certain effect in terms of event monitoring Fruit, but easily by noise jamming；Zhao et al. is in order to be ranked up the hot spot entry in microblogging, according to containing key term The information such as forward rate, the word frequency of microblogging calculate a probability value, the sequence based on " interesting degree " is obtained according to probability Formula (document [2] Zhao W X, Jiang J, He J, et al.Topical keyphrase extraction from twitter[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Association for Computational Linguistics,2011:379-388).Spina et al. lists the pumping that existing text extracts Mode is taken, it is finally unexpectedly simplest word-based by having carried out topic extraction to having marked microblogging language material on a small quantity Frequently the abstracting method of/inverse document frequency obtains best effect, while the pretreatment for demonstrating noun filtering is in this task Effectively (document [3] Spina D, Meij E, de Rijke M, et al.Identifying entity aspects in microblog posts[C]//Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval.ACM,2012: 1089-1090).Work (document [4] Das of the work relatively rough compared to forefathers, Abhimanyu and Anitha in 2014 A,Kannan A.Discovering topical aspects in microblogs[C]//Proceedings of COLING.The 25th International Conference on Computational Linguistics: Technical Papers,2014:860-871) just seem fully very much, they in order to excavate the much-talked-about topic in Twitter, By observing the general character of microblogging event, three evaluation indexes, respectively " diversity (Diversity) ", " uniqueness have been obtained (Uniqueness) " and " sudden (Burstiness) ", with the training corpus of weak mark by a gauss hybrid models come The distribution of fitting data, to export whether candidate angle is microblogging event, the topic abstracting method of such supervised learning Good effect can also be obtained, but unfortunately this algorithm is handled without reference to the clustering order of topic.

The emotional semantic classification of microblogging is varied, such as can be divided into " passing judgement on " two according to conventional sorting methods coarseness Class can also be divided into fine granularity " happiness anger sorrow is probably frightened " five classes (document [5] Zhao Y, Qin B, Liu T, et al.Social sentiment sensor:a visualization system for topic detection and topic sentiment analysis on micro-blog[J].Multimedia Tools and Applications,2014,22 (1):1-18).Rosenthal et al. used in SemEval2015 (Semantic Evaluation, semanteme evaluation and test) one Set feature based extract microblog emotional classification algorithm reached the world it is optimal (document [6] Rosenthal S, Nakov P, Kiritchenko S,et al.Semeval-2015task 10:Sentiment analysis in twitter[C]// Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval.2015)。

Invention content

The present invention is to solve the hierarchical clustering algorithm used in current microblog topic information processing method and correct to calculate The accuracy rate of method is low, and the relevant microblogging of event cannot be divided under correct theme, and the more accurate microblogging words proposed Inscribe information processing method.

It is a kind of Chinese microblog topic information processing method realize according to the following steps：

Step 1：The judgement of focus incident relevant microblog；

The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform Method of completing the square judges whether microblogging is related；

Step 2：The crucial topic of microblogging is found；

By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described Hashtag is topic information, i.e., the word in microblogging between two " # " symbols；

Step 3：The clustering order algorithm of topic；

After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if Topic carries out clustering processing；

(1) hierarchical clustering algorithm

Using Hashtag similarity of character string algorithms, i.e., the foundation that similarity of character string is calculated as distance in cluster, meter It is as follows to calculate formula：

The wherein described H_AAnd H_BFor S_AAnd S_BIn Hashtag character strings, S_AFor microblogging text A, S_BFor microblogging text B, LCS For the longest common subsequence of two character strings, Edit Distance are editing distance, to the numerical value of two similarity of character string Normalized is made, i.e., front and back two parts of formula are respectively divided by character string H_AAnd H_BIn min (Length (H_A), Length(H_B)) and max (Length (H_A),Length(H_B))；

(2) topic cluster result sort algorithm

Using according to the weighted connections of microblogging number and cluster result topic number as sort formula；

RankingScore (topic)=log (topic_weibonumber)·topic_num (4)

RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formula_weibonumbe _rFor topic The microblogging number contained down, topic_numFor the topic number merged in result, logarithm process has been carried out to microblogging number；

Step 4：Microblog topic correct algorithm；

(1) initial input：Result after topic clustering order total K, including preceding S topic and rear U topic；

(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", U topic according to It is divided into collection U1 to be predicted and training counter-example collection U2 with the sequencing of similarity of S topic；

(3) feature extraction and model training are carried out to the language material of S topic；

(4) the non-seed collection U1 to be predicted of model prediction for obtaining training；

(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging It is deleted from collection U1 to be predicted；

(6) it is started the cycle over from (2) step, until reach condition of the adding rate less than threshold value that S topic corresponds to microblogging, it is complete At cycle；

(7) final output：From the S topic and its relevant microblog of expansion；

Step 5：It is evaluated using 5 indexs of accuracy rate@；

Using the superiority-inferiority of the ranking results of 5 index reflection algorithm of accuracy rate, it is averaged and adding rate and is chased after using microblogging number Add evaluation index of the mean hit rate of microblogging as microblogging enlarging itself algorithm；

5 indexs of accuracy rate@are 5 most preceding correct topic numbers of prediction of ranking results and preceding 5 model answers The ratio of middle topic number, i.e. formula (5)：

Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula (6)：

The mean hit rate of additional microblogging, i.e., be appended in algorithm number that the microblogging of existing topic is correctly hit with it is current The microblogging number ratio of topic, i.e. formula (7)：

Invention effect：

The phenomenon that being distributed there are different emotions for the different topics of individual event for masses, present invention use is without prison Educational inspector practise hierarchical clustering sort method and semi-supervised learning two methods of microblog topic correct algorithm, carry out episode topic and The excavation of its relevant microblog, finally utilize a set of maturation Chinese microblog emotional sorting algorithm (Zekui Li, Yanyan Zhao, Bing Qin,et al.Feature Engineering for Chinese Micro-blog Sentiment Classification[J]Journal of Shanxi University(Natural Science Edition),2014, 37(4):570-579), achieve the purpose that carry out emotion distribution statistics and analysis to relevant microblog.

Topic clustering distance algorithm contrast and experiment shows that the accuracy rate of TF/IDF methods will be far below base for 53.3% In the accuracy rate 78.7% of Hashtag similarity of character string computational methods；Topic clustering distance algorithm contrast and experiment shows It is closed less than according to the weighting of microblogging number and cluster result topic number for 66.7% according to the accuracy rate of microblogging number sort method It is the accuracy rate 78.7% of sequence；By microblogging episode topic correct test, microblogging number of finally having compromised be averaged adding rate with Iterations are set as 2 by the mean hit rate of additional microblogging, ensure that the hit from expansion number and additional microblogging of microblogging Rate resultant effect is optimal.

Description of the drawings

Fig. 1 is flow chart of the present invention；

Fig. 2 is the emotion changes in distribution folding of " MERS the invades Guangdong " event on June 4,22 days to 2015 May in 2015 Line chart；

Fig. 3 is microblog topic information processing algorithm flow chart；

Fig. 4 is topic clustering order algorithm flow chart；

Fig. 5 is that the microblog topic based on semi-supervised learning corrects model flow figure；

Fig. 6 is microblog topic correct algorithm --- microblogging number is averaged adding rate result figure；

Fig. 7 is microblog topic correct algorithm --- the mean hit rate result figure of additional microblogging.

Specific implementation mode

Specific implementation mode one：As shown in Figure 1, a kind of Chinese microblog topic information processing method includes the following steps：

Step 1：The judgement of focus incident relevant microblog；

The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform Method of completing the square judges whether microblogging is related, and the language technology platform is Harbin Institute of Technology's language technology platform (Che W, Li Z, Liu T.Ltp:A chinese language technology platform[C]//Proceedings of the 23rd InternationalConference on Computational Linguistics: Demonstrations.Association for Computational Linguistics,2010:13-16)；

Step 2：Microblog topic is found；

Step 3：The clustering order algorithm of topic；

Focus incident relevant microblog, first will be right by Hashtag matching ways into topic discovery module as input The topic title answered extracts, and excavates N number of topic by being based on the strategies such as the sequence of microblog number purpose and filtering, topic clusters mould Block passes through hierarchical clustering (Hierarchical Cluster) algorithm (Kaufman L, Rousseeuw P J.Finding groups in data:An introduction to cluster analysis [M] .John Wiley＆Sons, 2009) with And some filtering rules, obtain K topic after candidate merging, by sort algorithm, the input extracted as crucial topic；

(1) hierarchical clustering algorithm

By comparing the similarity algorithm and Hashtag similarity of character string algorithms of TF/IDF, Hashtag is finally had selected The foundation that similarity of character string, i.e. similarity of character string are calculated as distance in cluster, wherein the TF/IDF is based on microblogging Word frequency/inverse document frequency, two kinds of calculating formula of similarity are as follows：

Similarity_TFIDF(S_A,S_B)=cosine (TFIDF (S_A),TFIDF(S_B)) (1)

S in formula_AFor microblogging text A, S_BFor microblogging text B, S_AAnd S_BIn Hashtag character strings be respectively H_AAnd H_B, Cosine is the function for calculating vectorial cosine angle, and LCS is the longest common subsequence of two character strings, Edit Distance For editing distance, in formula (2), it is assumed that two and in Hashtag character strings be respectively and, arrange the longest of two character strings Common subsequence is longer, and editing distance is shorter, their similarity is higher, in order to make formula have universality, to two characters The numerical value of string similarity made normalized, i.e., front and back two parts of formula are respectively divided by character string H_AAnd H_BIn it is shorter The length of character string, i.e. min (Length (H_A),Length(H_B)) and longer character string length, i.e. max (Length (H_A),Length(H_B))；

(2) topic cluster result sort algorithm

It is talked about according to microblogging number and cluster result according to microblogging number sort algorithm and formula (4) by contrast equation (3) The weighted connections sequence for inscribing number, has chosen and considers that more multifactor formula (4) is used as sort formula；

RankingScore (topic)=topic_weibonumber (3)

RankingScore (topic)=log (topic_weibonumber)·topic_num (4)

RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formula_weibonumberFor under topic The microblogging number contained, topic_numFor the topic number merged in result, in order to make index be comparable, to microblogging number Logarithm process is carried out, formula (4) considers the topic number under single cluster after cluster on the basis of formula (3)；

Step 4：Microblog topic correct algorithm；

(6) it is started the cycle over from (2) step, until reaching loop termination condition, i.e. the adding rate that S topic corresponds to microblogging is small In threshold value；

(7) final output：From the S topic and its relevant microblog of expansion；

The automatic extraction feature template of the crucial topic of table 1

Step 5：It is evaluated using 5 indexs of accuracy rate@；

The superiority-inferiority of the final ranking results using 5 index reflection algorithm of accuracy rate is averaged adding rate using microblogging number Evaluation index with the mean hit rate of additional microblogging as microblogging enlarging itself algorithm；

The definition of 5 index of accuracy rate is that 5 most preceding correct topic numbers of prediction of ranking results are answered with preceding 5 standards The ratio of topic number in case, i.e. formula (5)：

Specific implementation mode two：The present embodiment is different from the first embodiment in that：S topic in the step 4 The threshold value value of the adding rate of corresponding microblogging is 0.1.

Specific implementation mode three：The present embodiment is different from the first and the second embodiment in that：In step 4, in step (7) after obtaining final output, return repeats step (1) to step (7) again, and initial input is that step (7) is final The S topic and its relevant microblog from expansion of output.

Embodiment one：

It is found in actual analysis microblogging event, the emotion distribution of a microblog hot event was changed with the time 's.One of microblog hot event such as in May, 2015 " MERS invades Guangdong " event, its emotion are distributed broken line (from May 22 To June 4) as shown in Fig. 2, five kinds of broken lines respectively represent happy, indignation, sad, frightened and surprised five kinds of emotions variation.It can be with It was found that since May 22 MERS viruses there is the first in South Korea, event microblogging starts to capture, and in subsequent several days, event exists Constantly ferment.Mood in figure respectively in May 26, May 29, May 30 and equi-time point netizen on June 1 is distributed all It is varied from.

Why netizen as time goes by, can reflect that different emotions is distributed for above-mentioned eventBy systematically Analysis finds that in May 26, news report " South Korea MERS virus carrier makes a definite diagnosis in Guangdong ", everybody mood is frightened adds It is sad；May 29, rumor " the Virus patients' aggravation for receiving isolation treatment " on microblogging, the mood of netizen switchs to frightened residence It is more；The problems such as May 30, news disclosed " the supervision loophole and technology of South Korea's medical department are weak ", everybody mood switchs to again Indignation is in the majority；June 1, major website records were " doctors and nurses of Huizhou hospital ICU in order not to make epidemic situation spread, in lot The touching deed on hilllock ", it is domestic it is a piece of encourage, support and cry moment of praise it is surging get up, by it from is in mood Happy ratio goes up.

As can be seen that the emotion point changed as the time elapses from the analysis of " MERS poisoning intrusions Guangdong " event Cloth, what is represented in fact is different views of the netizen to the not ipsilateral of event.Such as the emotion distribution of " fear " is needle in event It is " South Korea's medical department supervises loophole " etc. to be directed to topic to news " Virus patients' aggravation ", " indignation ".The present invention In terms of by the different discussion of event, it is defined as the topic of individual event.

As shown in figure 3, a kind of Chinese microblog topic information processing method includes the following steps：

Step 1：The judgement of focus incident relevant microblog；

The present invention has collected in May, 2015 to the microblog hot event occurred between July 15 and its relevant microblogging such as table 2 It is shown；

Table 2 tests corresponding 15 microblog hot event parts

Second title for being classified as microblog hot event in table 2, third are classified as corresponding microblogging number, and microblogging sum is total 1210000.

Step 2：The crucial topic of microblogging is found；

It inscribes if representing not ipsilateral to most popular 5 of event described in step 1 and is manually marked, as standard Answer, few examples are as shown in table 3.

3 experimental data of table marks sample

Five that " Qingan County's shooting incident " and mark are listed in table 3 do not inscribe if ipsilateral.

Step 3：The clustering order algorithm of topic；

After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if Topic carries out clustering processing, such as " take journey official website be hacked or be man-made calamity " and " take journey server be hacked or be man-made calamity " the two topics It is just very similar, while the topic after merging is discussed warmly degree sequence according to it, the flow chart of this some algorithm is as shown in Figure 4.

(1) hierarchical clustering algorithm

Invention describes distance calculating methods when two kinds of hierarchical clusterings, are based on microblogging word frequency/inverse document frequency respectively Rate similarity based method (formula 1) and be based on Hashtag similarity of character string method (formula 2), topic clustering distance algorithm comparison Testing P@5, the results are shown in Table 4：

4 topic clustering distance computational methods contrast experiment of table

TF/IDF methods (being based on microblogging word frequency/inverse document frequency similarity based method) are can be seen that from the result in table 4 Accuracy rate will be far below be based on similarity of character string computational methods.Due to the coincidence of word used in the relevant microblogging of individual event It spends bigger (such as frequently occurring the vocabulary such as " taking journey ", " server " in " taking journey official website to be hacked " event relevant microblog), interferes Cluster process, causes the optimal similarity threshold of hierarchical clustering to be difficult to determine.The similarity of character string of Hashtag is used instead Method achieves good effect by the iterative process of hierarchical clustering in this paper tasks.

(2) topic cluster result sort algorithm

The present invention has carried out topic cluster result sort algorithm contrast experiment, i.e., simply (public according to the sequence of microblogging number Formula 3) and according to the weighted connections of microblogging number and cluster result topic number sequence (formula 4), this secures level in testing poly- The distance calculation formula of class is similarity of character string method.Experimental result is as shown in table 5：

5 topic cluster result sort algorithm contrast experiment of table

If can be seen that from the experimental result in table 5 merely according to the corresponding microblogging number of cluster after cluster as row Sequence index, sort method of the effect not as good as the topic information of number under cluster is added.The reason is that due to being clustered under single cluster Topic number also largely represents the degree of discussing warmly of the topic, i.e. topic contained by topic cluster result is more, discusses warmly Degree is higher.

Step 4：Microblog topic correct algorithm；

Clustering algorithm can tentatively return to a series of cluster results to have sorted, since cluster result is concentrated mainly on microblogging Hashtag character string case shells on, so cluster is easy to make the similar microblogging clusters of Hashtag together.But in reality In the language material of border, there are many Hashtag and the unmatched phenomenon of content of microblog, as shown in table 6：

Hashtag mismatches example with content in 6 microblogging of table

As can be seen that if the corresponding microbloggings of Hashtag are simply carried out emotion point from two microblogging examples of table 6 Cloth counts, and the corresponding emotion maldistribution of topic can be caused true, or even occurs in the topic that certain apparent masses have support mood Have in " the quiet haze video of bavin " in negative emotions, such as table 6, netizen be in fact to " contaminating enterprises " hold negative emotions rather than To haze video.

The work that step 4 is done is unfolded for above-mentioned phenomenon, and target is by Hashtag information and microblogging text The contradictory part microblogging of theme talked about carries out topic correction (the relevant microblogging of event is divided under correct theme), Achieve the effect that the relevant microblogging of topic, this task definition is that microblog topic correction is appointed by the present invention further from expanding Business, then algorithm can averagely give data set to increase how many microblogging, while the hit rate of additional microblogging how about, be evaluation this The index of a algorithm quality, the flow chart of this part algorithm are as shown in Figure 5.

It is intended to assess this two, devises microblogging number herein and be averaged the mean hit rate of adding rate and additional microblogging Two indices.Since algorithm is the semi-supervised learning algorithm of an iteration, so using line chart come visual representation experimental result, cross Axis is iterations, and the longitudinal axis is respectively two indexs, as shown in Figure 6, Figure 7.

Experiment language material has used the microblogging event data introduced in step 1.As shown in fig. 6, during algorithm iteration, Microblogging number adding rate increases to peak value 16.9% then from the 11.6% of beginning and reduces.Pass through comparative experimental data, reason It it is iteration initial stage, the training microblogging quantity added with disaggregated model increases, and causes to make the feature of model to increase, so as to pre- Survey more samples；Later stage due to remaining forecast sample number deficiency, causes adding rate to reduce.

By being sampled mark to additional microblogging, the experimental result of Fig. 7 has been obtained.The results show that additional microblogging Mean hit rate from the first round iteration when 88.5% to the 9th wheel 44.5%, reduce always.Cause the original of this result Because being as model iteration updates, one side training corpus is constantly increasing, and on the other hand training noise is also constantly adding up, most Grader classification capacity is caused constantly to weaken eventually.

Final present invention microblogging number of having compromised be averaged the mean hit rate of adding rate and addition microblogging, and iterations are set 2 are set to, ensure that the optimal from the hit rate resultant effect for expanding number and additional microblogging of microblogging.

Claims

1. a kind of Chinese microblog topic information processing method, which is characterized in that the treating method comprises following steps：

Step 1：The judgement of focus incident relevant microblog；

The relevant microblog for inputting single focus incident to Text Pretreatment and passes through Keywords matching side using language technology platform Method judges whether microblogging is related；

Step 2：The crucial topic of microblogging is found；

By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein the Hashtag For topic information, i.e., the word in microblogging between two " # " symbols；

Step 3：The clustering order algorithm of topic；

After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts work Refer to that topic information described in microblogging is subjected to extraction summary, the clustering order of topic refer to first by the similar topic in part into Row clustering processing；

(1) hierarchical clustering algorithm

Using Hashtag similarity of character string algorithms, i.e., the foundation that similarity of character string is calculated as distance in cluster calculates public Formula is as follows：

The wherein described H_AAnd H_BFor S_AAnd S_BIn Hashtag character strings, S_AFor microblogging text A, S_BFor microblogging text B, LCS two The longest common subsequence of a character string, Edit Distance are editing distance, are made to the numerical value of two similarity of character string Normalized, i.e. front and back two parts of formula are respectively divided by character string H_AAnd H_BIn min (Length (H_A),Length (H_B)) and max (Length (H_A),Length(H_B))；

(2) topic cluster result sort algorithm

RankingScore (topic)=log (topic_weibonumber)·topic_num (4)

RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formula_{weibonumbe r}To contain under topic Some microblogging numbers, topic_numFor the topic number merged in result, logarithm process has been carried out to microblogging number；

Step 4：Microblog topic correct algorithm；

(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", and U topic is according to a with S The sequencing of similarity of topic is divided into collection U1 to be predicted and training counter-example collection U2；

(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging from waiting for It is deleted in forecast set U1；

(6) it is started the cycle over from (2) step, until reaching condition of the adding rate less than threshold value that S topic corresponds to microblogging, completes to follow Ring；

(7) final output：From the S topic and its relevant microblog of expansion；

Step 5：It is evaluated using 5 indexs of accuracy rate@；

Using the superiority-inferiority of the ranking results of 5 index reflection algorithm of accuracy rate, using microblogging number be averaged adding rate and addition it is micro- Evaluation index of the rich mean hit rate as microblogging enlarging itself algorithm；

5 indexs of accuracy rate@are that 5 most preceding correct topic numbers of prediction of ranking results are talked about with preceding 5 model answers Inscribe the ratio of number, i.e. formula (5)：

The mean hit rate of additional microblogging, i.e., be appended to the number and actualite that the microblogging of existing topic is correctly hit in algorithm Microblogging number ratio, i.e. formula (7)：

2. a kind of Chinese microblog topic information processing method according to claim 1, it is characterised in that S in the step 4 The threshold value value that a topic corresponds to the adding rate of microblogging is 0.1.

3. a kind of Chinese microblog topic information processing method according to claim 1 or 2, which is characterized in that in step 4, After step (7) obtains final output, return repeats step (1) to step (7) again, and initial input is step (7) the S topic and its relevant microblog from expansion of final output.