CN105354216A - Chinese microblog topic information processing method - Google Patents

Chinese microblog topic information processing method Download PDF

Info

Publication number
CN105354216A
CN105354216A CN201510627783.0A CN201510627783A CN105354216A CN 105354216 A CN105354216 A CN 105354216A CN 201510627783 A CN201510627783 A CN 201510627783A CN 105354216 A CN105354216 A CN 105354216A
Authority
CN
China
Prior art keywords
topic
microblogging
microblog
algorithm
information processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510627783.0A
Other languages
Chinese (zh)
Other versions
CN105354216B (en
Inventor
赵妍妍
秦兵
李泽魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201510627783.0A priority Critical patent/CN105354216B/en
Publication of CN105354216A publication Critical patent/CN105354216A/en
Application granted granted Critical
Publication of CN105354216B publication Critical patent/CN105354216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Abstract

The invention discloses a Chinese microblog topic information processing method, and relates to reason analysis algorithms for emotional distribution of microblog events. The invention aims to solve the problems that a hierarchical clustering algorithm and a correction algorithm adopted in an existing microblog topic information processing method are low in accuracy and incapable of including event-related microblogs in correct topics. According to the Chinese microblog topic information processing method, event topics and related microblogs are mined with a hierarchical clustering ordering method of unsupervised learning and a microblog topic correction algorithm of semi-supervised learning, so that the purpose of performing emotional distribution statistics and analysis on the related microblogs is finally achieved. The Chinese microblog topic information processing method can perform microblog topic information processing more accurately. The present invention is applied to the microblog topic information processing field.The Chinese microblog topic information processing method is applied to the field of microblog topic information processing.

Description

A kind of Chinese microblog topic information processing method
Technical field
The present invention relates to microblog topic information processing method.
Background technology
Microblogging is as emerging social media platform, also be one of domestic most popular social media platform, there is hundreds of millions of any active ues, increasing netizen selects obtain on microblogging and share oneself interested information, before the large data surface of the average daily millions of microblogging, analysis netizen is one to the viewpoint of a certain event and attitude and significantly works, and increasing scholar starts to pay close attention to the so large data information behind of microblogging.
The time entering into people's life as the form of social media due to microblogging is not long, so the correlative study of the domestic and international distribution of the event emotion towards the microblogging analysis of causes is not a lot, the microblogging event method for digging of present stage mainly contains, 2011, the people such as Weng are by using in the monitoring of some term frequencies in microblogging text by the relative theory of wavelet transformation, burst vocabulary is selected by analyzing its autocorrelation sieves, cluster is accident (document [1]: WengJ, LeeBS.EventDetectioninTwitter [J] .ICWSM, 2011, 11:401-408), the method has certain effect in event monitoring, but be subject to noise, the people such as Zhao are in order to sort to the focus entry in microblogging, according to the forward rate of the microblogging containing key term, the information such as word frequency calculate a probable value, sequence formula (document [2] ZhaoWX based on " interesting degree " is drawn according to probability, JiangJ, HeJ, etal.Topicalkeyphraseextractionfromtwitter [C] //Proceedingsofthe49thAnnualMeetingoftheAssociationforComp utationalLinguistics:HumanLanguageTechnologies-Volume1.A ssociationforComputationalLinguistics, 2011:379-388).The people such as Spina list the extraction mode that existing text extracts, by having carried out topic extraction on a small quantity having marked microblogging language material, the last unexpectedly the simplest abstracting method based on word frequency/inverse document frequency obtains best effect, the pre-service simultaneously demonstrating noun filtration is effective (document [3] SpinaD in this task, MeijE, deRijkeM, etal.Identifyingentityaspectsinmicroblogposts [C] //Proceedingsofthe35thinternationalACMSIGIRconferenceonRes earchanddevelopmentininformationretrieval.ACM, 2012:1089-1090).Compare the work that forefathers are more coarse, Abhimanyu and Anitha was at work (document [4] DasA of 2014, KannanA.Discoveringtopicalaspectsinmicroblogs [C] //ProceedingsofCOLING.The25thInternationalConferenceonComp utationalLinguistics:TechnicalPapers, 2014:860-871) just seem fully a lot, they are in order to excavate the much-talked-about topic in Twitter, by observing the general character of microblogging event, three evaluation indexes are drawn, be respectively " diversity (Diversity) ", " uniqueness (Uniqueness) " and " sudden (Burstiness) ", carried out the distribution of fitting data by a gauss hybrid models with the corpus of weak mark, thus export whether candidate angle is microblogging event, the topic abstracting method of such supervised learning also can obtain good effect, but it's a pity that this algorithm does not relate to the clustering order process of topic.
The emotional semantic classification of microblogging is varied, such as traditionally can be divided into sorting technique coarseness " passing judgement on " two class, also " happiness anger sorrow is probably frightened " five classes (document [5] ZhaoY can be divided into fine granularity, QinB, LiuT, etal.Socialsentimentsensor:avisualizationsystemfortopicd etectionandtopicsentimentanalysisonmicro-blog [J] .MultimediaToolsandApplications, 2014,22 (1): 1-18).The people such as Rosenthal are at SemEval2015 (SemanticEvaluation, semantic evaluation and test) in the algorithm of microblog emotional classification that extracts of a set of feature based of using reach world's optimum (document [6] RosenthalS, NakovP, KiritchenkoS, etal.Semeval-2015task10:Sentimentanalysisintwitter [C] //Proceedingsofthe9thInternationalWorkshoponSemanticEvalua tion, SemEval.2015).
Summary of the invention
The present invention is that the accuracy rate in order to solve hierarchical clustering algorithm and the correct algorithm adopted in current microblog topic information processing method is low, under the microblogging that event is correlated with can not being divided into correct theme, and the information processing method of microblog topic more accurately proposed.
A kind of Chinese microblog topic information processing method realizes according to the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method;
Step 2: the crucial topic of microblogging finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, and the clustering order of topic refers to and first topic similar for part carried out clustering processing;
(1) hierarchical clustering algorithm
Adopt Hashtag similarity of character string algorithm, i.e. the foundation that calculates as cluster middle distance of similarity of character string, computing formula is as follows:
Similarity H a s h ta g ( H A , H B ) = L e n g t h ( L C S ( H A , H B ) ) min ( L e n g t h ( H A ) , L e n g t h ( H B ) ) + ( 1 - E d i t D i s tan c e ( H A , H B ) max ( L e n g t h ( H A ) , L e n g t h ( H B ) ) )
( 2 )
Wherein said H aand H bfor S aand S bin Hashtag character string, S afor microblogging text A, S bfor microblogging text B, LCS are the longest common subsequence of two character strings, EditDistance is editing distance, has made normalized to the numerical value of two similarity of character string, and namely front and back two parts of formula are respectively divided by character string H aand H bin min (Length (H a), Length (H b)) and max (Length (H a), Length (H b));
(2) topic cluster result sort algorithm
Adopt according to the weighted connections of microblogging number and cluster result topic number as the formula that sorts;
RankingScore(topic)=log(topic weibonumber)·topic num(4)
In formula, RankingScore (topic) is the sequence score that topic topic is corresponding, topic weibonumberfor the microblogging number contained under topic, topic numfor the topic number merged in result, logarithm process is carried out to microblogging number;
Step 4: microblog topic correct algorithm;
(1) initial input: the result after topic clustering order K altogether, comprises a front S topic and a rear U topic;
(2) a front S topic is divided into " seed topic ", and a rear U topic is divided into " non-seed topic ", and U topic is divided into the U1 of collection to be predicted according to the sequencing of similarity with S topic and trains counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the model prediction non-seed collection to be predicted U1 obtained will be trained;
(5) what microblogging classification results probability in U1 is greater than threshold value directly joins in a corresponding S topic, is deleted by microblogging from collection U1 to be predicted simultaneously;
(6) circulate from (2) step, until the adding rate reaching the corresponding microblogging of S topic is less than the condition of threshold value, complete circulation;
(7) finally export: from S the topic expanded and relevant microblog thereof;
Step 5: adopt accuracy rate 5 index to evaluate;
Adopt the superiority-inferiority of the ranking results of accuracy rate 5 index reflection algorithm, use the evaluation index of mean hit rate as microblogging enlarging itself algorithm of the average adding rate of microblogging number and additional microblogging;
Described accuracy rate 5 index is the ratio of topic number in ranking results the most front 5 the correct topic numbers of prediction and front 5 model answers, i.e. formula (5):
The average adding rate of microblogging number is the adding rate mean value of microblogging after expanding that each topic is correlated with, i.e. formula (6):
Add the mean hit rate of microblogging, the microblogging number ratio of the number that the microblogging being namely appended to existing topic in algorithm correctly hits and actualite, i.e. formula (7):
Invention effect:
For masses, the different topics of individual event are existed to the phenomenon of different emotion distributions, the present invention uses the hierarchical clustering sort method of unsupervised learning and microblog topic correct algorithm two kinds of methods of semi-supervised learning, carry out the excavation of episode topic and relevant microblog thereof, finally utilize the Chinese microblog emotional sorting algorithm (ZekuiLi of a set of maturation, YanyanZhao, BingQin, etal.FeatureEngineeringforChineseMicro-blogSentimentClas sification [J] JournalofShanxiUniversity (NaturalScienceEdition), 2014, 37 (4): 570-579), reach the object of relevant microblog being carried out to emotion distribution statistics and analysis.
Topic clustering distance algorithm contrast and experiment shows that the accuracy rate of TF/IDF method is 53.3% will far below the accuracy rate 78.7% based on Hashtag similarity of character string computing method; Topic clustering distance algorithm contrast and experiment shows that according to the accuracy rate of microblogging number sort method be 66.7% lower than the accuracy rate 78.7% sorted according to the weighted connections of microblogging number and cluster result topic number; Correct experiment by microblogging episode topic, the mean hit rate of finally the compromised average adding rate of microblogging number and additional microblogging, is set to 2 by iterations, ensure that the optimum from the hit rate resultant effect expanding number and additional microblogging of microblogging.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is the emotion changes in distribution broken line graph of " MERS the invades Guangdong " event on June 4,22 days to 2015 May in 2015;
Fig. 3 is microblog topic information processing algorithm process flow diagram;
Fig. 4 is topic clustering order algorithm flow chart;
Fig. 5 corrects model flow figure based on the microblog topic of semi-supervised learning;
Fig. 6 is microblog topic correct algorithm---the average adding rate result figure of microblogging number;
Fig. 7 is microblog topic correct algorithm---add the mean hit rate result figure of microblogging.
Embodiment
Embodiment one: as shown in Figure 1, a kind of Chinese microblog topic information processing method comprises the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method, described language technology platform is Harbin Institute of Technology language technology platform (CheW, LiZ, LiuT.Ltp:Achineselanguagetechnologyplatform [C] //Proceedingsofthe23rdInternationalConferenceonComputation alLinguistics:Demonstrations.AssociationforComputational Linguistics, 2010:13-16);
Step 2: microblog topic finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, and the clustering order of topic refers to and first topic similar for part carried out clustering processing;
Focus incident relevant microblog is as input, enter topic and find module, first extracted by the topic title of Hashtag matching way by correspondence, through excavating N number of topic based on strategies such as the sequence of microblogging number and filtrations, topic cluster module is by hierarchical clustering (HierarchicalCluster) algorithm (KaufmanL, RousseeuwPJ.Findinggroupsindata:anintroductiontoclustera nalysis [M] .JohnWiley & Sons, 2009) and some filtering rules, draw K topic after the merging of candidate, through sort algorithm, as the input that crucial topic extracts,
(1) hierarchical clustering algorithm
By contrasting similarity algorithm and the Hashtag similarity of character string algorithm of TF/IDF, finally have selected Hashtag similarity of character string, the i.e. foundation that calculates as cluster middle distance of similarity of character string, wherein said TF/IDF is based on microblogging word frequency/inverse document frequency, and two kinds of calculating formula of similarity are as follows:
Similarity TFIDF(S A,S B)=cosine(TFIDF(S A),TFIDF(S B))(1)
Similarity H a s h ta g ( H A , H B ) = L e n g t h ( L C S ( H A , H B ) ) min ( L e n g t h ( H A ) , L e n g t h ( H B ) ) + ( 1 - E d i t D i s tan c e ( H A , H B ) max ( L e n g t h ( H A ) , L e n g t h ( H B ) ) ) - - - ( 2 )
S in formula afor microblogging text A, S bfor microblogging text B, S aand S bin Hashtag character string be respectively H aand H bcosine is the function of compute vector cosine angle, LCS is the longest common subsequence of two character strings, EditDistance is editing distance, in formula (2), suppose two and in Hashtag character string be respectively and, the longest common subsequence of agreement two character strings is longer, editing distance is shorter, their similarity is higher, in order to make formula have universality, made normalized to the numerical value of two similarity of character string, namely front and back two parts of formula are respectively divided by character string H aand H bin the length of shorter character string, i.e. min (Length (H a), Length (H b)) and the length of longer character string, i.e. max (Length (H a), Length (H b));
(2) topic cluster result sort algorithm
Sorted according to the weighted connections of microblogging number and cluster result topic number according to microblogging number sort algorithm and formula (4) by contrast equation (3), have chosen the more multifactorial formula of consideration (4) as sequence formula;
RankingScore(topic)=topic weibonumber(3)
RankingScore(topic)=log(topic weibonumber)·topic num(4)
In formula, RankingScore (topic) is the sequence score that topic topic is corresponding, topic weibonumberfor the microblogging number contained under topic, topic numfor the topic number merged in result, in order to make index have comparability, logarithm process is carried out to microblogging number, the topic number after formula (4) considers cluster on the basis of formula (3) under single bunch;
Step 4: microblog topic correct algorithm;
(1) initial input: the result after topic clustering order K altogether, comprises a front S topic and a rear U topic;
(2) a front S topic is divided into " seed topic ", and a rear U topic is divided into " non-seed topic ", and U topic is divided into the U1 of collection to be predicted according to the sequencing of similarity with S topic and trains counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the model prediction non-seed collection to be predicted U1 obtained will be trained;
(5) what microblogging classification results probability in U1 is greater than threshold value directly joins in a corresponding S topic, is deleted by microblogging from collection U1 to be predicted simultaneously;
(6) circulate from (2) step, until reach loop termination condition, namely the adding rate of S the corresponding microblogging of topic is less than threshold value;
(7) finally export: from S the topic expanded and relevant microblog thereof;
Table 1 crucial topic Automatic Extraction feature templates
Step 5: adopt accuracy rate 5 index to evaluate;
The superiority-inferiority of the ranking results of final employing accuracy rate 5 index reflection algorithm, uses the evaluation index of mean hit rate as microblogging enlarging itself algorithm of the average adding rate of microblogging number and additional microblogging;
Accuracy rate 5 index is defined as the ratio of topic number in 5 the most front correct topic numbers of prediction of ranking results and front 5 model answers, i.e. formula (5):
The average adding rate of microblogging number is the adding rate mean value of microblogging after expanding that each topic is correlated with, i.e. formula (6):
Add the mean hit rate of microblogging, the microblogging number ratio of the number that the microblogging being namely appended to existing topic in algorithm correctly hits and actualite, i.e. formula (7):
Embodiment two: present embodiment and embodiment one unlike: in described step 4, the threshold value value of the adding rate of S the corresponding microblogging of topic is 0.1.
Embodiment three: present embodiment and embodiment one or two unlike: in step 4, after step (7) obtains final output, return again repeated execution of steps (1) to step (7), and S the topic from expansion that initial input is step (7) finally to be exported and relevant microblog thereof.
Embodiment one:
Find when actual analysis microblogging event, the emotion distribution of a microblog hot event changed along with the time.As one of microblog hot event " MERS the invades Guangdong " event in May, 2015, as shown in Figure 2, five kinds of broken lines represent happiness, indignation, sadness, fear and surprised five kinds of emotions change to its emotion distribution broken line (from May 22 to June 4) respectively.Can find from MERS on May 22 viral the first appears in Korea S, event microblogging starts to catch, and in subsequently several days, event is constantly being fermented.In figure respectively May 26, May 29, May 30 and June 1 equi-time point netizen mood distribution change all to some extent.
Why is netizen along with passage of time, can reflect different emotion distributions for above-mentioned event? by systematically analyzing, find May 26, news report " Korea S MERS virus carrier makes a definite diagnosis in Guangdong ", everybody mood is that fear adds sadness; May 29, rumor " Virus patients's aggravation that acceptance is isolated for treatment " on microblogging, the mood of netizen transfers to frightened in the majority; May 30, news disclosed problems such as " the supervision leaks of Korea S's medical department and technology weak ", and it is in the majority that everybody mood transfers again indignation to; June 1 each large website records touching deed of " doctors and nurses of Huizhou hospital ICU are spread to not make epidemic situation; draw lots on duty ", domestic a slice is encouraged, support and cry moment of praise surgingly gets up, by and what come is that ratio happy in mood goes up.
As can be seen from the analysis of " MERS poisoning intrusion Guangdong " event, the emotion distribution changed along with passage of time, what in fact represent is the different views of netizen to the not ipsilateral of event.Such as, in event the emotion distribution of " fear " be for news " Virus patients's aggravation ", " indignation " for be topic be " Korea S's medical department supervises leak ".The present invention, by the difference discussion aspect of event, is defined as the topic of individual event.
As shown in Figure 3, a kind of Chinese microblog topic information processing method comprises the following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method;
The present invention have collected the microblog hot event 15 occurred between in May, 2015 to July and the microblogging of being correlated with is as shown in table 2;
15 corresponding microblog hot event parts tested by table 2
Second title being classified as microblog hot event in table 2, the 3rd is classified as corresponding microblogging number, and microblogging sum amounts to 1,210,000.
Step 2: the crucial topic of microblogging finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Manually mark 5 the most popular topics representing not ipsilateral of event described in step one, as model answer, few examples is as shown in table 3.
Table 3 experimental data mark sample
The topic of five not ipsilaterals of " Qingan County's shooting incident " and mark is listed in table 3.
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, the clustering order of topic refers to and first topic similar for part is carried out clustering processing, such as " take journey official website by black or for man-made calamity " and " taking journey server by black or be man-made calamity " these two topics just very similar, discuss the topic after merging warmly degree sequence according to it, the process flow diagram of this part algorithm as shown in Figure 4 simultaneously.
(1) hierarchical clustering algorithm
Invention describes distance calculating method during two kinds of hierarchical clusterings, be based on microblogging word frequency/inverse document frequency similarity based method (formula 1) with based on Hashtag similarity of character string method (formula 2) respectively, topic clustering distance algorithm contrast experiment P5 result is as shown in table 4:
Table 4 topic clustering distance computing method contrast experiment
Result as can be seen from table 4, the accuracy rate of TF/IDF method (namely based on microblogging word frequency/inverse document frequency similarity based method) will far below based on similarity of character string computing method.The registration of the microblogging word used of being correlated with due to individual event larger (such as " take journey official website black " event relevant microblog in frequently occur the vocabulary such as " taking journey ", " server "), disturb cluster process, cause the optimum similarity threshold of hierarchical clustering to be difficult to determine.Use the similarity of character string method of Hashtag on the contrary, through the iterative process of hierarchical clustering, in this paper task, achieve good effect.
(2) topic cluster result sort algorithm
Invention has been topic cluster result sort algorithm contrast experiment, namely simple according to microblogging number sequence (formula 3) and sort (formula 4) according to the weighted connections of microblogging number and cluster result topic number, this distance computing formula securing hierarchical clustering in testing is similarity of character string method.Experimental result is as shown in table 5:
Table 5 topic cluster result sort algorithm contrast experiment
Experimental result as can be seen from table 5, if merely according to after cluster bunch of corresponding microblogging number as sequence index, effect is not as the sort method of the topic information of number under adding bunch.Its reason be also represent this topic to a great extent due to the topic number of single bunch of lower cluster discuss degree warmly, namely contained by topic cluster result, topic is more, and it is higher that it discusses degree warmly.
Step 4: microblog topic correct algorithm;
Clustering algorithm tentatively can return the complete cluster result of a series of sequence, because cluster result mainly concentrates on the Hashtag character string case shell of microblogging, so the microblogging cluster that cluster easily makes Hashtag similar together.But in actual language material, there is a lot of Hashtag and the unmatched phenomenon of content of microblog, as shown in table 6:
In table 6 microblogging, Hashtag does not mate example with content
As can be seen from two microblogging examples of table 6, if simply microblogging corresponding for Hashtag is carried out emotion distribution statistics, the emotion maldistribution that topic can be caused corresponding is true, even occur that some obvious masses has and support there is negative emotions in the topic of mood, in such as, " the quiet haze video of bavin " in table 6, netizen holds negative emotions to " contaminating enterprises " but not to haze video.
The work that step 4 is done launched for above-mentioned phenomenon, target is that Hashtag information and the part microblogging of the theme contradiction that microblogging text is talked about are carried out topic correction (under being divided into correct theme by the microblogging that event is relevant), reach microblogging that topic is correlated with further from the effect expanded, this task definition is that microblog topic corrects task by the present invention, so algorithm on average can increase how many microbloggings to data set, the hit rate of the microblogging simultaneously added how about, it is the index evaluating this algorithm quality, the process flow diagram of this part algorithm as shown in Figure 5.
For assessing these two, devise the mean hit rate two indices of the average adding rate of microblogging number and additional microblogging herein.Due to the semi-supervised learning algorithm that algorithm is an iteration, so adopt broken line graph to carry out visual representation experimental result, transverse axis is iterations, and the longitudinal axis is respectively two indexs, as shown in Figure 6, Figure 7.
Experiment language material employs the microblogging event data introduced in step one.As shown in Figure 6, in algorithm iteration process, microblogging number adding rate from 11.6% increase to peak value 16.9% and reduce subsequently.By contrast experiment's data, its reason is the iteration initial stage, and the training microblogging increasing number added along with disaggregated model, causes the feature making model to increase, thus can predict more sample; Later stage, due to remaining forecast sample number deficiency, causes adding rate to reduce.
By carrying out sampling mark to the microblogging added, draw the experimental result of Fig. 7.Result show, add microblogging mean hit rate take turns from 88.5% to the 9 during first round iteration 44.5%, reduce always.Cause the reason of this result to be along with model iteration upgrades, corpus is constantly in increase on the one hand, and training noise is also constantly cumulative on the other hand, finally causes sorter classification capacity constantly to weaken.
Final the present invention has compromised the mean hit rate of the average adding rate of microblogging number and additional microblogging, iterations is set to 2, ensure that the optimum from the hit rate resultant effect expanding number and additional microblogging of microblogging.

Claims (3)

1. a Chinese microblog topic information processing method, is characterized in that, the treating method comprises following steps:
Step one: the judgement of focus incident relevant microblog;
Input the relevant microblog of single focus incident, use language technology platform to Text Pretreatment and judge whether microblogging is correlated with by key word matching method;
Step 2: the crucial topic of microblogging finds;
By the Hashtag information in statistics microblogging, excavate the topic information in focus incident microblogging, wherein said Hashtag is topic information, the word namely in microblogging between two " # " symbols;
Step 3: the clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, first topic extraction and clustering order is carried out, wherein said topic extraction work refers to that the topic information described by microblogging is carried out extraction to be summed up, and the clustering order of topic refers to and first topic similar for part carried out clustering processing;
(1) hierarchical clustering algorithm
Adopt Hashtag similarity of character string algorithm, i.e. the foundation that calculates as cluster middle distance of similarity of character string, computing formula is as follows:
Similarity H a s h t a g ( H A , H B ) = L e n g t h ( L C S ( H A , H B ) ) min ( L e n g t h ( H A ) , L e n g t h ( H B ) ) + ( 1 - E d i t D i s tan c e ( H A , H B ) max ( L e n g t h ( H A ) , L e n g t h ( H B ) ) ) - - - ( 2 )
Wherein said H aand H bfor S aand S bin Hashtag character string, S afor microblogging text A, S bfor microblogging text B, LCS are the longest common subsequence of two character strings, EditDistance is editing distance, has made normalized to the numerical value of two similarity of character string, and namely front and back two parts of formula are respectively divided by character string H aand H bin min (Length (H a), Length (H b)) and max (Length (H a), Length (H b));
(2) topic cluster result sort algorithm
Adopt according to the weighted connections of microblogging number and cluster result topic number as the formula that sorts;
RankingScore(topic)=log(topic weibonumber)·topic num(4)
In formula, RankingScore (topic) is the sequence score that topic topic is corresponding, topic weibonumberfor the microblogging number contained under topic, topic numfor the topic number merged in result, logarithm process is carried out to microblogging number;
Step 4: microblog topic correct algorithm;
(1) initial input: the result after topic clustering order K altogether, comprises a front S topic and a rear U topic;
(2) a front S topic is divided into " seed topic ", and a rear U topic is divided into " non-seed topic ", and U topic is divided into the U1 of collection to be predicted according to the sequencing of similarity with S topic and trains counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the model prediction non-seed collection to be predicted U1 obtained will be trained;
(5) what microblogging classification results probability in U1 is greater than threshold value directly joins in a corresponding S topic, is deleted by microblogging from collection U1 to be predicted simultaneously;
(6) circulate from (2) step, until the adding rate reaching the corresponding microblogging of S topic is less than the condition of threshold value, complete circulation;
(7) finally export: from S the topic expanded and relevant microblog thereof;
Step 5: adopt accuracy rate 5 index to evaluate;
Adopt the superiority-inferiority of the ranking results of accuracy rate 5 index reflection algorithm, use the evaluation index of mean hit rate as microblogging enlarging itself algorithm of the average adding rate of microblogging number and additional microblogging;
Described accuracy rate 5 index is the ratio of topic number in ranking results the most front 5 the correct topic numbers of prediction and front 5 model answers, i.e. formula (5):
The average adding rate of microblogging number is the adding rate mean value of microblogging after expanding that each topic is correlated with, i.e. formula (6):
Add the mean hit rate of microblogging, the microblogging number ratio of the number that the microblogging being namely appended to existing topic in algorithm correctly hits and actualite, i.e. formula (7):
2. the Chinese microblog topic information processing method of one according to claim 1, is characterized in that the threshold value value of the adding rate of S the corresponding microblogging of topic in described step 4 is 0.1.
3. the Chinese microblog topic information processing method of one according to claim 1 and 2, it is characterized in that, in step 4, after step (7) obtains final output, return again repeated execution of steps (1) to step (7), and S the topic from expansion that initial input is step (7) finally to be exported and relevant microblog thereof.
CN201510627783.0A 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method Active CN105354216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510627783.0A CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510627783.0A CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Publications (2)

Publication Number Publication Date
CN105354216A true CN105354216A (en) 2016-02-24
CN105354216B CN105354216B (en) 2018-09-07

Family

ID=55330189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510627783.0A Active CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Country Status (1)

Country Link
CN (1) CN105354216B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740466A (en) * 2016-03-04 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for excavating incidence relation between hotspot concepts
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
CN106503064A (en) * 2016-09-29 2017-03-15 中国国防科技信息中心 A kind of generation method of self adaptation microblog topic summary
CN109255121A (en) * 2018-07-27 2019-01-22 中山大学 A kind of across language biomedicine class academic paper information recommendation method based on theme class
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device
CN110020147A (en) * 2017-11-29 2019-07-16 北京京东尚科信息技术有限公司 Model generates, method for distinguishing, system, equipment and storage medium are known in comment
CN110795943A (en) * 2019-09-25 2020-02-14 中国科学院计算技术研究所 Topic representation generation method and system for event
CN110852076A (en) * 2019-10-12 2020-02-28 云知声智能科技股份有限公司 Method and device for automatic disease code conversion
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
US8180775B2 (en) * 2010-06-23 2012-05-15 National Central University Computer-implemented method for clustering data and computer-readable medium encoded with computer program to execute thereof
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN104516947A (en) * 2014-12-03 2015-04-15 浙江工业大学 Chinese microblog emotion analysis method fused with dominant and recessive characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
US8180775B2 (en) * 2010-06-23 2012-05-15 National Central University Computer-implemented method for clustering data and computer-readable medium encoded with computer program to execute thereof
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN104516947A (en) * 2014-12-03 2015-04-15 浙江工业大学 Chinese microblog emotion analysis method fused with dominant and recessive characters

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱玺等: ""基于半监督学习的微博情感倾向性分析"", 《山东大学学报(理学版)》 *
李泽魁等: ""中文微博情感倾向性分析特征工程"", 《山西大学学报(自然科学版)》 *
赵妍妍等: ""文本情感分析"", 《JOURNAL OF SOFTWARE》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740466A (en) * 2016-03-04 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for excavating incidence relation between hotspot concepts
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
CN105868415B (en) * 2016-05-06 2019-08-09 黑龙江工程学院 A kind of microblogging real time filtering model based on historical weibo
CN106503064A (en) * 2016-09-29 2017-03-15 中国国防科技信息中心 A kind of generation method of self adaptation microblog topic summary
CN106503064B (en) * 2016-09-29 2019-07-02 中国国防科技信息中心 A kind of generation method of adaptive microblog topic abstract
CN110020147A (en) * 2017-11-29 2019-07-16 北京京东尚科信息技术有限公司 Model generates, method for distinguishing, system, equipment and storage medium are known in comment
CN109255121A (en) * 2018-07-27 2019-01-22 中山大学 A kind of across language biomedicine class academic paper information recommendation method based on theme class
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device
CN109299280B (en) * 2018-12-12 2020-09-29 河北工程大学 Short text clustering analysis method and device and terminal equipment
CN110795943A (en) * 2019-09-25 2020-02-14 中国科学院计算技术研究所 Topic representation generation method and system for event
CN110795943B (en) * 2019-09-25 2021-10-08 中国科学院计算技术研究所 Topic representation generation method and system for event
CN110852076A (en) * 2019-10-12 2020-02-28 云知声智能科技股份有限公司 Method and device for automatic disease code conversion
CN110852076B (en) * 2019-10-12 2023-05-30 云知声智能科技股份有限公司 Method and device for automatic disease code conversion
CN111046282A (en) * 2019-12-06 2020-04-21 贝壳技术有限公司 Text label setting method, device, medium and electronic equipment
CN111046282B (en) * 2019-12-06 2021-04-16 北京房江湖科技有限公司 Text label setting method, device, medium and electronic equipment

Also Published As

Publication number Publication date
CN105354216B (en) 2018-09-07

Similar Documents

Publication Publication Date Title
CN105354216A (en) Chinese microblog topic information processing method
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106503055B (en) A kind of generation method from structured text to iamge description
CN107491432B (en) Low-quality article identification method and device based on artificial intelligence, equipment and medium
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
Tomassetti et al. Linked data approach for selection process automation in systematic reviews
CN111209384A (en) Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN113449204B (en) Social event classification method and device based on local aggregation graph attention network
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN103177024A (en) Method and device of topic information show
Golestani et al. Real-time prediction of employee engagement using social media and text mining
Shrivastava et al. Enhancing aggression detection using GPT-2 based data balancing technique
Domingos et al. Just add weights: Markov logic for the semantic web
Arefi et al. Assessing post deletion in Sina Weibo: Multi-modal classification of hot topics
Daouadi et al. Real-Time Bot Detection from Twitter Using the Twitterbot+ Framework.
Parolin et al. Hanke: Hierarchical attention networks for knowledge extraction in political science domain
CN103034657B (en) Documentation summary generates method and apparatus
Theophilo et al. Explainable artificial intelligence for authorship attribution on social media
Bordea et al. Evaluation dataset and methodology for extracting application-specific taxonomies from the Wikipedia knowledge graph
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
CN104809253A (en) Internet data analysis system
Litvak et al. What’s up on Twitter? Catch up with TWIST!
CN112434212B (en) Case-related news topic model construction method and device based on neural autoregressive distribution estimation
Khan et al. Ontology design pattern detection-initial method and usage scenarios

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210425

Address after: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin jizuo technology partnership (L.P.)

Patentee after: Harbin Institute of Technology Asset Management Co.,Ltd.

Address before: 150001 Harbin, Nangang, West District, large straight street, No. 92

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210611

Address after: Room 206-12, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.

Address before: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee before: Harbin jizuo technology partnership (L.P.)

Patentee before: Harbin Institute of Technology Asset Management Co.,Ltd.