A kind of Chinese microblog topic information processing method
Technical field
The present invention relates to microblog topic information processing method.
Background technology
Microblogging is as emerging social media platform, also be one of domestic most popular social media platform, there is hundreds of millions of any active ues, increasing netizen selects obtain on microblogging and share oneself interested information, before the large data surface of the average daily millions of microblogging, analysis netizen is one to the viewpoint of a certain event and attitude and significantly works, and increasing scholar starts to pay close attention to the so large data information behind of microblogging.
The time entering into people's life as the form of social media due to microblogging is not long, so the correlative study of the domestic and international distribution of the event emotion towards the microblogging analysis of causes is not a lot, the microblogging event method for digging of present stage mainly contains, 2011, the people such as Weng are by using in the monitoring of some term frequencies in microblogging text by the relative theory of wavelet transformation, burst vocabulary is selected by analyzing its autocorrelation sieves, cluster is accident (document [1]: WengJ, LeeBS.EventDetectioninTwitter [J] .ICWSM, 2011, 11:401-408), the method has certain effect in event monitoring, but be subject to noise, the people such as Zhao are in order to sort to the focus entry in microblogging, according to the forward rate of the microblogging containing key term, the information such as word frequency calculate a probable value, sequence formula (document [2] ZhaoWX based on " interesting degree " is drawn according to probability, JiangJ, HeJ, etal.Topicalkeyphraseextractionfromtwitter [C] //Proceedingsofthe49thAnnualMeetingoftheAssociationforComp utationalLinguistics:HumanLanguageTechnologies-Volume1.A ssociationforComputationalLinguistics, 2011:379-388).The people such as Spina list the extraction mode that existing text extracts, by having carried out topic extraction on a small quantity having marked microblogging language material, the last unexpectedly the simplest abstracting method based on word frequency/inverse document frequency obtains best effect, the pre-service simultaneously demonstrating noun filtration is effective (document [3] SpinaD in this task, MeijE, deRijkeM, etal.Identifyingentityaspectsinmicroblogposts [C] //Proceedingsofthe35thinternationalACMSIGIRconferenceonRes earchanddevelopmentininformationretrieval.ACM, 2012:1089-1090).Compare the work that forefathers are more coarse, Abhimanyu and Anitha was at work (document [4] DasA of 2014, KannanA.Discoveringtopicalaspectsinmicroblogs [C] //ProceedingsofCOLING.The25thInternationalConferenceonComp utationalLinguistics:TechnicalPapers, 2014:860-871) just seem fully a lot, they are in order to excavate the much-talked-about topic in Twitter, by observing the general character of microblogging event, three evaluation indexes are drawn, be respectively " diversity (Diversity) ", " uniqueness (Uniqueness) " and " sudden (Burstiness) ", carried out the distribution of fitting data by a gauss hybrid models with the corpus of weak mark, thus export whether candidate angle is microblogging event, the topic abstracting method of such supervised learning also can obtain good effect, but it's a pity that this algorithm does not relate to the clustering order process of topic.
The emotional semantic classification of microblogging is varied, such as traditionally can be divided into sorting technique coarseness " passing judgement on " two class, also " happiness anger sorrow is probably frightened " five classes (document [5] ZhaoY can be divided into fine granularity, QinB, LiuT, etal.Socialsentimentsensor:avisualizationsystemfortopicd etectionandtopicsentimentanalysisonmicro-blog [J] .MultimediaToolsandApplications, 2014,22 (1): 1-18).The people such as Rosenthal are at SemEval2015 (SemanticEvaluation, semantic evaluation and test) in the algorithm of microblog emotional classification that extracts of a set of feature based of using reach world's optimum (document [6] RosenthalS, NakovP, KiritchenkoS, etal.Semeval-2015task10:Sentimentanalysisintwitter [C] //Proceedingsofthe9thInternationalWorkshoponSemanticEvalua tion, SemEval.2015).
Summary of the invention
The present invention is that the accuracy rate in order to solve hierarchical clustering algorithm and the correct algorithm adopted in current microblog topic information processing method is low, under the microblogging that event is correlated with can not being divided into correct theme, and the information processing method of microblog topic more accurately proposed.
A kind of Chinese microblog topic information processing method realizes according to the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method;
Step 2: the crucial topic of microblogging finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, and the clustering order of topic refers to and first topic similar for part carried out clustering processing;
(1) hierarchical clustering algorithm
Adopt Hashtag similarity of character string algorithm, i.e. the foundation that calculates as cluster middle distance of similarity of character string, computing formula is as follows:
Wherein said H
aand H
bfor S
aand S
bin Hashtag character string, S
afor microblogging text A, S
bfor microblogging text B, LCS are the longest common subsequence of two character strings, EditDistance is editing distance, has made normalized to the numerical value of two similarity of character string, and namely front and back two parts of formula are respectively divided by character string H
aand H
bin min (Length (H
a), Length (H
b)) and max (Length (H
a), Length (H
b));
(2) topic cluster result sort algorithm
Adopt according to the weighted connections of microblogging number and cluster result topic number as the formula that sorts;
RankingScore(topic)=log(topic
weibonumber)·topic
num(4)
In formula, RankingScore (topic) is the sequence score that topic topic is corresponding, topic
weibonumberfor the microblogging number contained under topic, topic
numfor the topic number merged in result, logarithm process is carried out to microblogging number;
Step 4: microblog topic correct algorithm;
(1) initial input: the result after topic clustering order K altogether, comprises a front S topic and a rear U topic;
(2) a front S topic is divided into " seed topic ", and a rear U topic is divided into " non-seed topic ", and U topic is divided into the U1 of collection to be predicted according to the sequencing of similarity with S topic and trains counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the model prediction non-seed collection to be predicted U1 obtained will be trained;
(5) what microblogging classification results probability in U1 is greater than threshold value directly joins in a corresponding S topic, is deleted by microblogging from collection U1 to be predicted simultaneously;
(6) circulate from (2) step, until the adding rate reaching the corresponding microblogging of S topic is less than the condition of threshold value, complete circulation;
(7) finally export: from S the topic expanded and relevant microblog thereof;
Step 5: adopt accuracy rate 5 index to evaluate;
Adopt the superiority-inferiority of the ranking results of accuracy rate 5 index reflection algorithm, use the evaluation index of mean hit rate as microblogging enlarging itself algorithm of the average adding rate of microblogging number and additional microblogging;
Described accuracy rate 5 index is the ratio of topic number in ranking results the most front 5 the correct topic numbers of prediction and front 5 model answers, i.e. formula (5):
The average adding rate of microblogging number is the adding rate mean value of microblogging after expanding that each topic is correlated with, i.e. formula (6):
Add the mean hit rate of microblogging, the microblogging number ratio of the number that the microblogging being namely appended to existing topic in algorithm correctly hits and actualite, i.e. formula (7):
Invention effect:
For masses, the different topics of individual event are existed to the phenomenon of different emotion distributions, the present invention uses the hierarchical clustering sort method of unsupervised learning and microblog topic correct algorithm two kinds of methods of semi-supervised learning, carry out the excavation of episode topic and relevant microblog thereof, finally utilize the Chinese microblog emotional sorting algorithm (ZekuiLi of a set of maturation, YanyanZhao, BingQin, etal.FeatureEngineeringforChineseMicro-blogSentimentClas sification [J] JournalofShanxiUniversity (NaturalScienceEdition), 2014, 37 (4): 570-579), reach the object of relevant microblog being carried out to emotion distribution statistics and analysis.
Topic clustering distance algorithm contrast and experiment shows that the accuracy rate of TF/IDF method is 53.3% will far below the accuracy rate 78.7% based on Hashtag similarity of character string computing method; Topic clustering distance algorithm contrast and experiment shows that according to the accuracy rate of microblogging number sort method be 66.7% lower than the accuracy rate 78.7% sorted according to the weighted connections of microblogging number and cluster result topic number; Correct experiment by microblogging episode topic, the mean hit rate of finally the compromised average adding rate of microblogging number and additional microblogging, is set to 2 by iterations, ensure that the optimum from the hit rate resultant effect expanding number and additional microblogging of microblogging.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the emotion changes in distribution broken line graph of " MERS the invades Guangdong " event on June 4,22 days to 2015 May in 2015;
Fig. 3 is microblog topic information processing algorithm process flow diagram;
Fig. 4 is topic clustering order algorithm flow chart;
Fig. 5 corrects model flow figure based on the microblog topic of semi-supervised learning;
Fig. 6 is microblog topic correct algorithm---the average adding rate result figure of microblogging number;
Fig. 7 is microblog topic correct algorithm---add the mean hit rate result figure of microblogging.
Embodiment
Embodiment one: as shown in Figure 1, a kind of Chinese microblog topic information processing method comprises the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method, described language technology platform is Harbin Institute of Technology language technology platform (CheW, LiZ, LiuT.Ltp:Achineselanguagetechnologyplatform [C] //Proceedingsofthe23rdInternationalConferenceonComputation alLinguistics:Demonstrations.AssociationforComputational Linguistics, 2010:13-16);
Step 2: microblog topic finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, and the clustering order of topic refers to and first topic similar for part carried out clustering processing;
Focus incident relevant microblog is as input, enter topic and find module, first extracted by the topic title of Hashtag matching way by correspondence, through excavating N number of topic based on strategies such as the sequence of microblogging number and filtrations, topic cluster module is by hierarchical clustering (HierarchicalCluster) algorithm (KaufmanL, RousseeuwPJ.Findinggroupsindata:anintroductiontoclustera nalysis [M] .JohnWiley & Sons, 2009) and some filtering rules, draw K topic after the merging of candidate, through sort algorithm, as the input that crucial topic extracts,
(1) hierarchical clustering algorithm
By contrasting similarity algorithm and the Hashtag similarity of character string algorithm of TF/IDF, finally have selected Hashtag similarity of character string, the i.e. foundation that calculates as cluster middle distance of similarity of character string, wherein said TF/IDF is based on microblogging word frequency/inverse document frequency, and two kinds of calculating formula of similarity are as follows:
Similarity
TFIDF(S
A,S
B)=cosine(TFIDF(S
A),TFIDF(S
B))(1)
S in formula
afor microblogging text A, S
bfor microblogging text B, S
aand S
bin Hashtag character string be respectively H
aand H
bcosine is the function of compute vector cosine angle, LCS is the longest common subsequence of two character strings, EditDistance is editing distance, in formula (2), suppose two and in Hashtag character string be respectively and, the longest common subsequence of agreement two character strings is longer, editing distance is shorter, their similarity is higher, in order to make formula have universality, made normalized to the numerical value of two similarity of character string, namely front and back two parts of formula are respectively divided by character string H
aand H
bin the length of shorter character string, i.e. min (Length (H
a), Length (H
b)) and the length of longer character string, i.e. max (Length (H
a), Length (H
b));
(2) topic cluster result sort algorithm
Sorted according to the weighted connections of microblogging number and cluster result topic number according to microblogging number sort algorithm and formula (4) by contrast equation (3), have chosen the more multifactorial formula of consideration (4) as sequence formula;
RankingScore(topic)=topic
weibonumber(3)
RankingScore(topic)=log(topic
weibonumber)·topic
num(4)
In formula, RankingScore (topic) is the sequence score that topic topic is corresponding, topic
weibonumberfor the microblogging number contained under topic, topic
numfor the topic number merged in result, in order to make index have comparability, logarithm process is carried out to microblogging number, the topic number after formula (4) considers cluster on the basis of formula (3) under single bunch;
Step 4: microblog topic correct algorithm;
(1) initial input: the result after topic clustering order K altogether, comprises a front S topic and a rear U topic;
(2) a front S topic is divided into " seed topic ", and a rear U topic is divided into " non-seed topic ", and U topic is divided into the U1 of collection to be predicted according to the sequencing of similarity with S topic and trains counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the model prediction non-seed collection to be predicted U1 obtained will be trained;
(5) what microblogging classification results probability in U1 is greater than threshold value directly joins in a corresponding S topic, is deleted by microblogging from collection U1 to be predicted simultaneously;
(6) circulate from (2) step, until reach loop termination condition, namely the adding rate of S the corresponding microblogging of topic is less than threshold value;
(7) finally export: from S the topic expanded and relevant microblog thereof;
Table 1 crucial topic Automatic Extraction feature templates
Step 5: adopt accuracy rate 5 index to evaluate;
The superiority-inferiority of the ranking results of final employing accuracy rate 5 index reflection algorithm, uses the evaluation index of mean hit rate as microblogging enlarging itself algorithm of the average adding rate of microblogging number and additional microblogging;
Accuracy rate 5 index is defined as the ratio of topic number in 5 the most front correct topic numbers of prediction of ranking results and front 5 model answers, i.e. formula (5):
The average adding rate of microblogging number is the adding rate mean value of microblogging after expanding that each topic is correlated with, i.e. formula (6):
Add the mean hit rate of microblogging, the microblogging number ratio of the number that the microblogging being namely appended to existing topic in algorithm correctly hits and actualite, i.e. formula (7):
Embodiment two: present embodiment and embodiment one unlike: in described step 4, the threshold value value of the adding rate of S the corresponding microblogging of topic is 0.1.
Embodiment three: present embodiment and embodiment one or two unlike: in step 4, after step (7) obtains final output, return again repeated execution of steps (1) to step (7), and S the topic from expansion that initial input is step (7) finally to be exported and relevant microblog thereof.
Embodiment one:
Find when actual analysis microblogging event, the emotion distribution of a microblog hot event changed along with the time.As one of microblog hot event " MERS the invades Guangdong " event in May, 2015, as shown in Figure 2, five kinds of broken lines represent happiness, indignation, sadness, fear and surprised five kinds of emotions change to its emotion distribution broken line (from May 22 to June 4) respectively.Can find from MERS on May 22 viral the first appears in Korea S, event microblogging starts to catch, and in subsequently several days, event is constantly being fermented.In figure respectively May 26, May 29, May 30 and June 1 equi-time point netizen mood distribution change all to some extent.
Why is netizen along with passage of time, can reflect different emotion distributions for above-mentioned event? by systematically analyzing, find May 26, news report " Korea S MERS virus carrier makes a definite diagnosis in Guangdong ", everybody mood is that fear adds sadness; May 29, rumor " Virus patients's aggravation that acceptance is isolated for treatment " on microblogging, the mood of netizen transfers to frightened in the majority; May 30, news disclosed problems such as " the supervision leaks of Korea S's medical department and technology weak ", and it is in the majority that everybody mood transfers again indignation to; June 1 each large website records touching deed of " doctors and nurses of Huizhou hospital ICU are spread to not make epidemic situation; draw lots on duty ", domestic a slice is encouraged, support and cry moment of praise surgingly gets up, by and what come is that ratio happy in mood goes up.
As can be seen from the analysis of " MERS poisoning intrusion Guangdong " event, the emotion distribution changed along with passage of time, what in fact represent is the different views of netizen to the not ipsilateral of event.Such as, in event the emotion distribution of " fear " be for news " Virus patients's aggravation ", " indignation " for be topic be " Korea S's medical department supervises leak ".The present invention, by the difference discussion aspect of event, is defined as the topic of individual event.
As shown in Figure 3, a kind of Chinese microblog topic information processing method comprises the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method;
The present invention have collected the microblog hot event 15 occurred between in May, 2015 to July and the microblogging of being correlated with is as shown in table 2;
15 corresponding microblog hot event parts tested by table 2
Second title being classified as microblog hot event in table 2, the 3rd is classified as corresponding microblogging number, and microblogging sum amounts to 1,210,000.
Step 2: the crucial topic of microblogging finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Manually mark 5 the most popular topics representing not ipsilateral of event described in step one, as model answer, few examples is as shown in table 3.
Table 3 experimental data mark sample
The topic of five not ipsilaterals of " Qingan County's shooting incident " and mark is listed in table 3.
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, the clustering order of topic refers to and first topic similar for part is carried out clustering processing, such as " take journey official website by black or for man-made calamity " and " taking journey server by black or be man-made calamity " these two topics just very similar, discuss the topic after merging warmly degree sequence according to it, the process flow diagram of this part algorithm as shown in Figure 4 simultaneously.
(1) hierarchical clustering algorithm
Invention describes distance calculating method during two kinds of hierarchical clusterings, be based on microblogging word frequency/inverse document frequency similarity based method (formula 1) with based on Hashtag similarity of character string method (formula 2) respectively, topic clustering distance algorithm contrast experiment P5 result is as shown in table 4:
Table 4 topic clustering distance computing method contrast experiment
Result as can be seen from table 4, the accuracy rate of TF/IDF method (namely based on microblogging word frequency/inverse document frequency similarity based method) will far below based on similarity of character string computing method.The registration of the microblogging word used of being correlated with due to individual event larger (such as " take journey official website black " event relevant microblog in frequently occur the vocabulary such as " taking journey ", " server "), disturb cluster process, cause the optimum similarity threshold of hierarchical clustering to be difficult to determine.Use the similarity of character string method of Hashtag on the contrary, through the iterative process of hierarchical clustering, in this paper task, achieve good effect.
(2) topic cluster result sort algorithm
Invention has been topic cluster result sort algorithm contrast experiment, namely simple according to microblogging number sequence (formula 3) and sort (formula 4) according to the weighted connections of microblogging number and cluster result topic number, this distance computing formula securing hierarchical clustering in testing is similarity of character string method.Experimental result is as shown in table 5:
Table 5 topic cluster result sort algorithm contrast experiment
Experimental result as can be seen from table 5, if merely according to after cluster bunch of corresponding microblogging number as sequence index, effect is not as the sort method of the topic information of number under adding bunch.Its reason be also represent this topic to a great extent due to the topic number of single bunch of lower cluster discuss degree warmly, namely contained by topic cluster result, topic is more, and it is higher that it discusses degree warmly.
Step 4: microblog topic correct algorithm;
Clustering algorithm tentatively can return the complete cluster result of a series of sequence, because cluster result mainly concentrates on the Hashtag character string case shell of microblogging, so the microblogging cluster that cluster easily makes Hashtag similar together.But in actual language material, there is a lot of Hashtag and the unmatched phenomenon of content of microblog, as shown in table 6:
In table 6 microblogging, Hashtag does not mate example with content
As can be seen from two microblogging examples of table 6, if simply microblogging corresponding for Hashtag is carried out emotion distribution statistics, the emotion maldistribution that topic can be caused corresponding is true, even occur that some obvious masses has and support there is negative emotions in the topic of mood, in such as, " the quiet haze video of bavin " in table 6, netizen holds negative emotions to " contaminating enterprises " but not to haze video.
The work that step 4 is done launched for above-mentioned phenomenon, target is that Hashtag information and the part microblogging of the theme contradiction that microblogging text is talked about are carried out topic correction (under being divided into correct theme by the microblogging that event is relevant), reach microblogging that topic is correlated with further from the effect expanded, this task definition is that microblog topic corrects task by the present invention, so algorithm on average can increase how many microbloggings to data set, the hit rate of the microblogging simultaneously added how about, it is the index evaluating this algorithm quality, the process flow diagram of this part algorithm as shown in Figure 5.
For assessing these two, devise the mean hit rate two indices of the average adding rate of microblogging number and additional microblogging herein.Due to the semi-supervised learning algorithm that algorithm is an iteration, so adopt broken line graph to carry out visual representation experimental result, transverse axis is iterations, and the longitudinal axis is respectively two indexs, as shown in Figure 6, Figure 7.
Experiment language material employs the microblogging event data introduced in step one.As shown in Figure 6, in algorithm iteration process, microblogging number adding rate from 11.6% increase to peak value 16.9% and reduce subsequently.By contrast experiment's data, its reason is the iteration initial stage, and the training microblogging increasing number added along with disaggregated model, causes the feature making model to increase, thus can predict more sample; Later stage, due to remaining forecast sample number deficiency, causes adding rate to reduce.
By carrying out sampling mark to the microblogging added, draw the experimental result of Fig. 7.Result show, add microblogging mean hit rate take turns from 88.5% to the 9 during first round iteration 44.5%, reduce always.Cause the reason of this result to be along with model iteration upgrades, corpus is constantly in increase on the one hand, and training noise is also constantly cumulative on the other hand, finally causes sorter classification capacity constantly to weaken.
Final the present invention has compromised the mean hit rate of the average adding rate of microblogging number and additional microblogging, iterations is set to 2, ensure that the optimum from the hit rate resultant effect expanding number and additional microblogging of microblogging.