A kind of Chinese microblog topic information processing method
Technical field
The present invention relates to microblog topic information processing methods.
Background technology
Microblogging is as one of emerging social media platform, and domestic most popular social media platform, and there is numbers
Any active ues in terms of hundred million, more and more netizen's selections obtain on microblogging and share oneself interested information, in microblogging
In face of the big data of average daily millions, analysis netizen is one with attitude to the viewpoint of a certain event and significantly works,
More and more scholars begin to focus on the information of big data behind as microblogging.
Due to microblogging as the form of social media enter into people life time it is not long, so both at home and abroad towards microblogging
Event emotion distribution the analysis of causes correlative study be not very much, microblogging event method for digging at this stage mainly has, 2011
Year, Weng et al. is passed through by using in microblogging text the relative theory of wavelet transformation in the monitoring of some term frequencies
It analyzes its autocorrelation sieves and selects burst vocabulary, cluster as accident (document [1]:Weng J,Lee B S.Event
Detection in Twitter[J].ICWSM,2011,11:401-408), this method has certain effect in terms of event monitoring
Fruit, but easily by noise jamming;Zhao et al. is in order to be ranked up the hot spot entry in microblogging, according to containing key term
The information such as forward rate, the word frequency of microblogging calculate a probability value, the sequence based on " interesting degree " is obtained according to probability
Formula (document [2] Zhao W X, Jiang J, He J, et al.Topical keyphrase extraction from
twitter[C]//Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics:Human Language Technologies-Volume 1.Association
for Computational Linguistics,2011:379-388).Spina et al. lists the pumping that existing text extracts
Mode is taken, it is finally unexpectedly simplest word-based by having carried out topic extraction to having marked microblogging language material on a small quantity
Frequently the abstracting method of/inverse document frequency obtains best effect, while the pretreatment for demonstrating noun filtering is in this task
Effectively (document [3] Spina D, Meij E, de Rijke M, et al.Identifying entity aspects in
microblog posts[C]//Proceedings of the 35th international ACM SIGIR
conference on Research and development in information retrieval.ACM,2012:
1089-1090).Work (document [4] Das of the work relatively rough compared to forefathers, Abhimanyu and Anitha in 2014
A,Kannan A.Discovering topical aspects in microblogs[C]//Proceedings of
COLING.The 25th International Conference on Computational Linguistics:
Technical Papers,2014:860-871) just seem fully very much, they in order to excavate the much-talked-about topic in Twitter,
By observing the general character of microblogging event, three evaluation indexes, respectively " diversity (Diversity) ", " uniqueness have been obtained
(Uniqueness) " and " sudden (Burstiness) ", with the training corpus of weak mark by a gauss hybrid models come
The distribution of fitting data, to export whether candidate angle is microblogging event, the topic abstracting method of such supervised learning
Good effect can also be obtained, but unfortunately this algorithm is handled without reference to the clustering order of topic.
The emotional semantic classification of microblogging is varied, such as can be divided into " passing judgement on " two according to conventional sorting methods coarseness
Class can also be divided into fine granularity " happiness anger sorrow is probably frightened " five classes (document [5] Zhao Y, Qin B, Liu T, et al.Social
sentiment sensor:a visualization system for topic detection and topic
sentiment analysis on micro-blog[J].Multimedia Tools and Applications,2014,22
(1):1-18).Rosenthal et al. used in SemEval2015 (Semantic Evaluation, semanteme evaluation and test) one
Set feature based extract microblog emotional classification algorithm reached the world it is optimal (document [6] Rosenthal S, Nakov P,
Kiritchenko S,et al.Semeval-2015task 10:Sentiment analysis in twitter[C]//
Proceedings of the 9th International Workshop on Semantic Evaluation,
SemEval.2015)。
Invention content
The present invention is to solve the hierarchical clustering algorithm used in current microblog topic information processing method and correct to calculate
The accuracy rate of method is low, and the relevant microblogging of event cannot be divided under correct theme, and the more accurate microblogging words proposed
Inscribe information processing method.
It is a kind of Chinese microblog topic information processing method realize according to the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform
Method of completing the square judges whether microblogging is related;
Step 2:The crucial topic of microblogging is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described
Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts
Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if
Topic carries out clustering processing;
(1) hierarchical clustering algorithm
Using Hashtag similarity of character string algorithms, i.e., the foundation that similarity of character string is calculated as distance in cluster, meter
It is as follows to calculate formula:
The wherein described HAAnd HBFor SAAnd SBIn Hashtag character strings, SAFor microblogging text A, SBFor microblogging text B, LCS
For the longest common subsequence of two character strings, Edit Distance are editing distance, to the numerical value of two similarity of character string
Normalized is made, i.e., front and back two parts of formula are respectively divided by character string HAAnd HBIn min (Length (HA),
Length(HB)) and max (Length (HA),Length(HB));
(2) topic cluster result sort algorithm
Using according to the weighted connections of microblogging number and cluster result topic number as sort formula;
RankingScore (topic)=log (topicweibonumber)·topicnum (4)
RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formulaweibonumbe rFor topic
The microblogging number contained down, topicnumFor the topic number merged in result, logarithm process has been carried out to microblogging number;
Step 4:Microblog topic correct algorithm;
(1) initial input:Result after topic clustering order total K, including preceding S topic and rear U topic;
(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", U topic according to
It is divided into collection U1 to be predicted and training counter-example collection U2 with the sequencing of similarity of S topic;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the non-seed collection U1 to be predicted of model prediction for obtaining training;
(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging
It is deleted from collection U1 to be predicted;
(6) it is started the cycle over from (2) step, until reach condition of the adding rate less than threshold value that S topic corresponds to microblogging, it is complete
At cycle;
(7) final output:From the S topic and its relevant microblog of expansion;
Step 5:It is evaluated using 5 indexs of accuracy rate@;
Using the superiority-inferiority of the ranking results of 5 index reflection algorithm of accuracy rate, it is averaged and adding rate and is chased after using microblogging number
Add evaluation index of the mean hit rate of microblogging as microblogging enlarging itself algorithm;
5 indexs of accuracy rate@are 5 most preceding correct topic numbers of prediction of ranking results and preceding 5 model answers
The ratio of middle topic number, i.e. formula (5):
Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula
(6):
The mean hit rate of additional microblogging, i.e., be appended in algorithm number that the microblogging of existing topic is correctly hit with it is current
The microblogging number ratio of topic, i.e. formula (7):
Invention effect:
The phenomenon that being distributed there are different emotions for the different topics of individual event for masses, present invention use is without prison
Educational inspector practise hierarchical clustering sort method and semi-supervised learning two methods of microblog topic correct algorithm, carry out episode topic and
The excavation of its relevant microblog, finally utilize a set of maturation Chinese microblog emotional sorting algorithm (Zekui Li, Yanyan Zhao,
Bing Qin,et al.Feature Engineering for Chinese Micro-blog Sentiment
Classification[J]Journal of Shanxi University(Natural Science Edition),2014,
37(4):570-579), achieve the purpose that carry out emotion distribution statistics and analysis to relevant microblog.
Topic clustering distance algorithm contrast and experiment shows that the accuracy rate of TF/IDF methods will be far below base for 53.3%
In the accuracy rate 78.7% of Hashtag similarity of character string computational methods;Topic clustering distance algorithm contrast and experiment shows
It is closed less than according to the weighting of microblogging number and cluster result topic number for 66.7% according to the accuracy rate of microblogging number sort method
It is the accuracy rate 78.7% of sequence;By microblogging episode topic correct test, microblogging number of finally having compromised be averaged adding rate with
Iterations are set as 2 by the mean hit rate of additional microblogging, ensure that the hit from expansion number and additional microblogging of microblogging
Rate resultant effect is optimal.
Description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is the emotion changes in distribution folding of " MERS the invades Guangdong " event on June 4,22 days to 2015 May in 2015
Line chart;
Fig. 3 is microblog topic information processing algorithm flow chart;
Fig. 4 is topic clustering order algorithm flow chart;
Fig. 5 is that the microblog topic based on semi-supervised learning corrects model flow figure;
Fig. 6 is microblog topic correct algorithm --- microblogging number is averaged adding rate result figure;
Fig. 7 is microblog topic correct algorithm --- the mean hit rate result figure of additional microblogging.
Specific implementation mode
Specific implementation mode one:As shown in Figure 1, a kind of Chinese microblog topic information processing method includes the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform
Method of completing the square judges whether microblogging is related, and the language technology platform is Harbin Institute of Technology's language technology platform (Che W, Li Z, Liu
T.Ltp:A chinese language technology platform[C]//Proceedings of the 23rd
InternationalConference on Computational Linguistics:
Demonstrations.Association for Computational Linguistics,2010:13-16);
Step 2:Microblog topic is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described
Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts
Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if
Topic carries out clustering processing;
Focus incident relevant microblog, first will be right by Hashtag matching ways into topic discovery module as input
The topic title answered extracts, and excavates N number of topic by being based on the strategies such as the sequence of microblog number purpose and filtering, topic clusters mould
Block passes through hierarchical clustering (Hierarchical Cluster) algorithm (Kaufman L, Rousseeuw P J.Finding
groups in data:An introduction to cluster analysis [M] .John Wiley&Sons, 2009) with
And some filtering rules, obtain K topic after candidate merging, by sort algorithm, the input extracted as crucial topic;
(1) hierarchical clustering algorithm
By comparing the similarity algorithm and Hashtag similarity of character string algorithms of TF/IDF, Hashtag is finally had selected
The foundation that similarity of character string, i.e. similarity of character string are calculated as distance in cluster, wherein the TF/IDF is based on microblogging
Word frequency/inverse document frequency, two kinds of calculating formula of similarity are as follows:
SimilarityTFIDF(SA,SB)=cosine (TFIDF (SA),TFIDF(SB)) (1)
S in formulaAFor microblogging text A, SBFor microblogging text B, SAAnd SBIn Hashtag character strings be respectively HAAnd HB,
Cosine is the function for calculating vectorial cosine angle, and LCS is the longest common subsequence of two character strings, Edit Distance
For editing distance, in formula (2), it is assumed that two and in Hashtag character strings be respectively and, arrange the longest of two character strings
Common subsequence is longer, and editing distance is shorter, their similarity is higher, in order to make formula have universality, to two characters
The numerical value of string similarity made normalized, i.e., front and back two parts of formula are respectively divided by character string HAAnd HBIn it is shorter
The length of character string, i.e. min (Length (HA),Length(HB)) and longer character string length, i.e. max (Length
(HA),Length(HB));
(2) topic cluster result sort algorithm
It is talked about according to microblogging number and cluster result according to microblogging number sort algorithm and formula (4) by contrast equation (3)
The weighted connections sequence for inscribing number, has chosen and considers that more multifactor formula (4) is used as sort formula;
RankingScore (topic)=topicweibonumber (3)
RankingScore (topic)=log (topicweibonumber)·topicnum (4)
RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formulaweibonumberFor under topic
The microblogging number contained, topicnumFor the topic number merged in result, in order to make index be comparable, to microblogging number
Logarithm process is carried out, formula (4) considers the topic number under single cluster after cluster on the basis of formula (3);
Step 4:Microblog topic correct algorithm;
(1) initial input:Result after topic clustering order total K, including preceding S topic and rear U topic;
(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", U topic according to
It is divided into collection U1 to be predicted and training counter-example collection U2 with the sequencing of similarity of S topic;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the non-seed collection U1 to be predicted of model prediction for obtaining training;
(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging
It is deleted from collection U1 to be predicted;
(6) it is started the cycle over from (2) step, until reaching loop termination condition, i.e. the adding rate that S topic corresponds to microblogging is small
In threshold value;
(7) final output:From the S topic and its relevant microblog of expansion;
The automatic extraction feature template of the crucial topic of table 1
Step 5:It is evaluated using 5 indexs of accuracy rate@;
The superiority-inferiority of the final ranking results using 5 index reflection algorithm of accuracy rate is averaged adding rate using microblogging number
Evaluation index with the mean hit rate of additional microblogging as microblogging enlarging itself algorithm;
The definition of 5 index of accuracy rate is that 5 most preceding correct topic numbers of prediction of ranking results are answered with preceding 5 standards
The ratio of topic number in case, i.e. formula (5):
Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula
(6):
The mean hit rate of additional microblogging, i.e., be appended in algorithm number that the microblogging of existing topic is correctly hit with it is current
The microblogging number ratio of topic, i.e. formula (7):
Specific implementation mode two:The present embodiment is different from the first embodiment in that:S topic in the step 4
The threshold value value of the adding rate of corresponding microblogging is 0.1.
Specific implementation mode three:The present embodiment is different from the first and the second embodiment in that:In step 4, in step
(7) after obtaining final output, return repeats step (1) to step (7) again, and initial input is that step (7) is final
The S topic and its relevant microblog from expansion of output.
Embodiment one:
It is found in actual analysis microblogging event, the emotion distribution of a microblog hot event was changed with the time
's.One of microblog hot event such as in May, 2015 " MERS invades Guangdong " event, its emotion are distributed broken line (from May 22
To June 4) as shown in Fig. 2, five kinds of broken lines respectively represent happy, indignation, sad, frightened and surprised five kinds of emotions variation.It can be with
It was found that since May 22 MERS viruses there is the first in South Korea, event microblogging starts to capture, and in subsequent several days, event exists
Constantly ferment.Mood in figure respectively in May 26, May 29, May 30 and equi-time point netizen on June 1 is distributed all
It is varied from.
Why netizen as time goes by, can reflect that different emotions is distributed for above-mentioned eventBy systematically
Analysis finds that in May 26, news report " South Korea MERS virus carrier makes a definite diagnosis in Guangdong ", everybody mood is frightened adds
It is sad;May 29, rumor " the Virus patients' aggravation for receiving isolation treatment " on microblogging, the mood of netizen switchs to frightened residence
It is more;The problems such as May 30, news disclosed " the supervision loophole and technology of South Korea's medical department are weak ", everybody mood switchs to again
Indignation is in the majority;June 1, major website records were " doctors and nurses of Huizhou hospital ICU in order not to make epidemic situation spread, in lot
The touching deed on hilllock ", it is domestic it is a piece of encourage, support and cry moment of praise it is surging get up, by it from is in mood
Happy ratio goes up.
As can be seen that the emotion point changed as the time elapses from the analysis of " MERS poisoning intrusions Guangdong " event
Cloth, what is represented in fact is different views of the netizen to the not ipsilateral of event.Such as the emotion distribution of " fear " is needle in event
It is " South Korea's medical department supervises loophole " etc. to be directed to topic to news " Virus patients' aggravation ", " indignation ".The present invention
In terms of by the different discussion of event, it is defined as the topic of individual event.
As shown in figure 3, a kind of Chinese microblog topic information processing method includes the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform
Method of completing the square judges whether microblogging is related;
The present invention has collected in May, 2015 to the microblog hot event occurred between July 15 and its relevant microblogging such as table 2
It is shown;
Table 2 tests corresponding 15 microblog hot event parts
Second title for being classified as microblog hot event in table 2, third are classified as corresponding microblogging number, and microblogging sum is total
1210000.
Step 2:The crucial topic of microblogging is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described
Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
It inscribes if representing not ipsilateral to most popular 5 of event described in step 1 and is manually marked, as standard
Answer, few examples are as shown in table 3.
3 experimental data of table marks sample
Five that " Qingan County's shooting incident " and mark are listed in table 3 do not inscribe if ipsilateral.
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts
Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if
Topic carries out clustering processing, such as " take journey official website be hacked or be man-made calamity " and " take journey server be hacked or be man-made calamity " the two topics
It is just very similar, while the topic after merging is discussed warmly degree sequence according to it, the flow chart of this some algorithm is as shown in Figure 4.
(1) hierarchical clustering algorithm
Invention describes distance calculating methods when two kinds of hierarchical clusterings, are based on microblogging word frequency/inverse document frequency respectively
Rate similarity based method (formula 1) and be based on Hashtag similarity of character string method (formula 2), topic clustering distance algorithm comparison
Testing P@5, the results are shown in Table 4:
4 topic clustering distance computational methods contrast experiment of table
TF/IDF methods (being based on microblogging word frequency/inverse document frequency similarity based method) are can be seen that from the result in table 4
Accuracy rate will be far below be based on similarity of character string computational methods.Due to the coincidence of word used in the relevant microblogging of individual event
It spends bigger (such as frequently occurring the vocabulary such as " taking journey ", " server " in " taking journey official website to be hacked " event relevant microblog), interferes
Cluster process, causes the optimal similarity threshold of hierarchical clustering to be difficult to determine.The similarity of character string of Hashtag is used instead
Method achieves good effect by the iterative process of hierarchical clustering in this paper tasks.
(2) topic cluster result sort algorithm
The present invention has carried out topic cluster result sort algorithm contrast experiment, i.e., simply (public according to the sequence of microblogging number
Formula 3) and according to the weighted connections of microblogging number and cluster result topic number sequence (formula 4), this secures level in testing poly-
The distance calculation formula of class is similarity of character string method.Experimental result is as shown in table 5:
5 topic cluster result sort algorithm contrast experiment of table
If can be seen that from the experimental result in table 5 merely according to the corresponding microblogging number of cluster after cluster as row
Sequence index, sort method of the effect not as good as the topic information of number under cluster is added.The reason is that due to being clustered under single cluster
Topic number also largely represents the degree of discussing warmly of the topic, i.e. topic contained by topic cluster result is more, discusses warmly
Degree is higher.
Step 4:Microblog topic correct algorithm;
Clustering algorithm can tentatively return to a series of cluster results to have sorted, since cluster result is concentrated mainly on microblogging
Hashtag character string case shells on, so cluster is easy to make the similar microblogging clusters of Hashtag together.But in reality
In the language material of border, there are many Hashtag and the unmatched phenomenon of content of microblog, as shown in table 6:
Hashtag mismatches example with content in 6 microblogging of table
As can be seen that if the corresponding microbloggings of Hashtag are simply carried out emotion point from two microblogging examples of table 6
Cloth counts, and the corresponding emotion maldistribution of topic can be caused true, or even occurs in the topic that certain apparent masses have support mood
Have in " the quiet haze video of bavin " in negative emotions, such as table 6, netizen be in fact to " contaminating enterprises " hold negative emotions rather than
To haze video.
The work that step 4 is done is unfolded for above-mentioned phenomenon, and target is by Hashtag information and microblogging text
The contradictory part microblogging of theme talked about carries out topic correction (the relevant microblogging of event is divided under correct theme),
Achieve the effect that the relevant microblogging of topic, this task definition is that microblog topic correction is appointed by the present invention further from expanding
Business, then algorithm can averagely give data set to increase how many microblogging, while the hit rate of additional microblogging how about, be evaluation this
The index of a algorithm quality, the flow chart of this part algorithm are as shown in Figure 5.
It is intended to assess this two, devises microblogging number herein and be averaged the mean hit rate of adding rate and additional microblogging
Two indices.Since algorithm is the semi-supervised learning algorithm of an iteration, so using line chart come visual representation experimental result, cross
Axis is iterations, and the longitudinal axis is respectively two indexs, as shown in Figure 6, Figure 7.
Experiment language material has used the microblogging event data introduced in step 1.As shown in fig. 6, during algorithm iteration,
Microblogging number adding rate increases to peak value 16.9% then from the 11.6% of beginning and reduces.Pass through comparative experimental data, reason
It it is iteration initial stage, the training microblogging quantity added with disaggregated model increases, and causes to make the feature of model to increase, so as to pre-
Survey more samples;Later stage due to remaining forecast sample number deficiency, causes adding rate to reduce.
By being sampled mark to additional microblogging, the experimental result of Fig. 7 has been obtained.The results show that additional microblogging
Mean hit rate from the first round iteration when 88.5% to the 9th wheel 44.5%, reduce always.Cause the original of this result
Because being as model iteration updates, one side training corpus is constantly increasing, and on the other hand training noise is also constantly adding up, most
Grader classification capacity is caused constantly to weaken eventually.
Final present invention microblogging number of having compromised be averaged the mean hit rate of adding rate and addition microblogging, and iterations are set
2 are set to, ensure that the optimal from the hit rate resultant effect for expanding number and additional microblogging of microblogging.