CN105354216B - A kind of Chinese microblog topic information processing method - Google Patents

A kind of Chinese microblog topic information processing method Download PDF

Info

Publication number
CN105354216B
CN105354216B CN201510627783.0A CN201510627783A CN105354216B CN 105354216 B CN105354216 B CN 105354216B CN 201510627783 A CN201510627783 A CN 201510627783A CN 105354216 B CN105354216 B CN 105354216B
Authority
CN
China
Prior art keywords
topic
microblogging
microblog
algorithm
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510627783.0A
Other languages
Chinese (zh)
Other versions
CN105354216A (en
Inventor
赵妍妍
秦兵
李泽魁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201510627783.0A priority Critical patent/CN105354216B/en
Publication of CN105354216A publication Critical patent/CN105354216A/en
Application granted granted Critical
Publication of CN105354216B publication Critical patent/CN105354216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Abstract

A kind of Chinese microblog topic information processing method, the reason of being distributed the present invention relates to microblogging event emotion parser.The present invention is that the relevant microblogging of event cannot be divided under correct theme in order to which the accuracy rate for solving the hierarchical clustering algorithm and correct algorithm that are used in current microblog topic information processing method is low.The present invention carries out the excavation of episode topic and its relevant microblog using the two methods of microblog topic correct algorithm of the hierarchical clustering sort method and semi-supervised learning of unsupervised learning, is finally reached the purpose that emotion distribution statistics and analysis are carried out to relevant microblog.The present invention can more accurately carry out microblog topic information processing.The present invention is applied to microblog topic field of information processing.

Description

A kind of Chinese microblog topic information processing method
Technical field
The present invention relates to microblog topic information processing methods.
Background technology
Microblogging is as one of emerging social media platform, and domestic most popular social media platform, and there is numbers Any active ues in terms of hundred million, more and more netizen's selections obtain on microblogging and share oneself interested information, in microblogging In face of the big data of average daily millions, analysis netizen is one with attitude to the viewpoint of a certain event and significantly works, More and more scholars begin to focus on the information of big data behind as microblogging.
Due to microblogging as the form of social media enter into people life time it is not long, so both at home and abroad towards microblogging Event emotion distribution the analysis of causes correlative study be not very much, microblogging event method for digging at this stage mainly has, 2011 Year, Weng et al. is passed through by using in microblogging text the relative theory of wavelet transformation in the monitoring of some term frequencies It analyzes its autocorrelation sieves and selects burst vocabulary, cluster as accident (document [1]:Weng J,Lee B S.Event Detection in Twitter[J].ICWSM,2011,11:401-408), this method has certain effect in terms of event monitoring Fruit, but easily by noise jamming;Zhao et al. is in order to be ranked up the hot spot entry in microblogging, according to containing key term The information such as forward rate, the word frequency of microblogging calculate a probability value, the sequence based on " interesting degree " is obtained according to probability Formula (document [2] Zhao W X, Jiang J, He J, et al.Topical keyphrase extraction from twitter[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Association for Computational Linguistics,2011:379-388).Spina et al. lists the pumping that existing text extracts Mode is taken, it is finally unexpectedly simplest word-based by having carried out topic extraction to having marked microblogging language material on a small quantity Frequently the abstracting method of/inverse document frequency obtains best effect, while the pretreatment for demonstrating noun filtering is in this task Effectively (document [3] Spina D, Meij E, de Rijke M, et al.Identifying entity aspects in microblog posts[C]//Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval.ACM,2012: 1089-1090).Work (document [4] Das of the work relatively rough compared to forefathers, Abhimanyu and Anitha in 2014 A,Kannan A.Discovering topical aspects in microblogs[C]//Proceedings of COLING.The 25th International Conference on Computational Linguistics: Technical Papers,2014:860-871) just seem fully very much, they in order to excavate the much-talked-about topic in Twitter, By observing the general character of microblogging event, three evaluation indexes, respectively " diversity (Diversity) ", " uniqueness have been obtained (Uniqueness) " and " sudden (Burstiness) ", with the training corpus of weak mark by a gauss hybrid models come The distribution of fitting data, to export whether candidate angle is microblogging event, the topic abstracting method of such supervised learning Good effect can also be obtained, but unfortunately this algorithm is handled without reference to the clustering order of topic.
The emotional semantic classification of microblogging is varied, such as can be divided into " passing judgement on " two according to conventional sorting methods coarseness Class can also be divided into fine granularity " happiness anger sorrow is probably frightened " five classes (document [5] Zhao Y, Qin B, Liu T, et al.Social sentiment sensor:a visualization system for topic detection and topic sentiment analysis on micro-blog[J].Multimedia Tools and Applications,2014,22 (1):1-18).Rosenthal et al. used in SemEval2015 (Semantic Evaluation, semanteme evaluation and test) one Set feature based extract microblog emotional classification algorithm reached the world it is optimal (document [6] Rosenthal S, Nakov P, Kiritchenko S,et al.Semeval-2015task 10:Sentiment analysis in twitter[C]// Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval.2015)。
Invention content
The present invention is to solve the hierarchical clustering algorithm used in current microblog topic information processing method and correct to calculate The accuracy rate of method is low, and the relevant microblogging of event cannot be divided under correct theme, and the more accurate microblogging words proposed Inscribe information processing method.
It is a kind of Chinese microblog topic information processing method realize according to the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform Method of completing the square judges whether microblogging is related;
Step 2:The crucial topic of microblogging is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if Topic carries out clustering processing;
(1) hierarchical clustering algorithm
Using Hashtag similarity of character string algorithms, i.e., the foundation that similarity of character string is calculated as distance in cluster, meter It is as follows to calculate formula:
The wherein described HAAnd HBFor SAAnd SBIn Hashtag character strings, SAFor microblogging text A, SBFor microblogging text B, LCS For the longest common subsequence of two character strings, Edit Distance are editing distance, to the numerical value of two similarity of character string Normalized is made, i.e., front and back two parts of formula are respectively divided by character string HAAnd HBIn min (Length (HA), Length(HB)) and max (Length (HA),Length(HB));
(2) topic cluster result sort algorithm
Using according to the weighted connections of microblogging number and cluster result topic number as sort formula;
RankingScore (topic)=log (topicweibonumber)·topicnum (4)
RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formulaweibonumbe rFor topic The microblogging number contained down, topicnumFor the topic number merged in result, logarithm process has been carried out to microblogging number;
Step 4:Microblog topic correct algorithm;
(1) initial input:Result after topic clustering order total K, including preceding S topic and rear U topic;
(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", U topic according to It is divided into collection U1 to be predicted and training counter-example collection U2 with the sequencing of similarity of S topic;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the non-seed collection U1 to be predicted of model prediction for obtaining training;
(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging It is deleted from collection U1 to be predicted;
(6) it is started the cycle over from (2) step, until reach condition of the adding rate less than threshold value that S topic corresponds to microblogging, it is complete At cycle;
(7) final output:From the S topic and its relevant microblog of expansion;
Step 5:It is evaluated using 5 indexs of accuracy rate@;
Using the superiority-inferiority of the ranking results of 5 index reflection algorithm of accuracy rate, it is averaged and adding rate and is chased after using microblogging number Add evaluation index of the mean hit rate of microblogging as microblogging enlarging itself algorithm;
5 indexs of accuracy rate@are 5 most preceding correct topic numbers of prediction of ranking results and preceding 5 model answers The ratio of middle topic number, i.e. formula (5):
Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula (6):
The mean hit rate of additional microblogging, i.e., be appended in algorithm number that the microblogging of existing topic is correctly hit with it is current The microblogging number ratio of topic, i.e. formula (7):
Invention effect:
The phenomenon that being distributed there are different emotions for the different topics of individual event for masses, present invention use is without prison Educational inspector practise hierarchical clustering sort method and semi-supervised learning two methods of microblog topic correct algorithm, carry out episode topic and The excavation of its relevant microblog, finally utilize a set of maturation Chinese microblog emotional sorting algorithm (Zekui Li, Yanyan Zhao, Bing Qin,et al.Feature Engineering for Chinese Micro-blog Sentiment Classification[J]Journal of Shanxi University(Natural Science Edition),2014, 37(4):570-579), achieve the purpose that carry out emotion distribution statistics and analysis to relevant microblog.
Topic clustering distance algorithm contrast and experiment shows that the accuracy rate of TF/IDF methods will be far below base for 53.3% In the accuracy rate 78.7% of Hashtag similarity of character string computational methods;Topic clustering distance algorithm contrast and experiment shows It is closed less than according to the weighting of microblogging number and cluster result topic number for 66.7% according to the accuracy rate of microblogging number sort method It is the accuracy rate 78.7% of sequence;By microblogging episode topic correct test, microblogging number of finally having compromised be averaged adding rate with Iterations are set as 2 by the mean hit rate of additional microblogging, ensure that the hit from expansion number and additional microblogging of microblogging Rate resultant effect is optimal.
Description of the drawings
Fig. 1 is flow chart of the present invention;
Fig. 2 is the emotion changes in distribution folding of " MERS the invades Guangdong " event on June 4,22 days to 2015 May in 2015 Line chart;
Fig. 3 is microblog topic information processing algorithm flow chart;
Fig. 4 is topic clustering order algorithm flow chart;
Fig. 5 is that the microblog topic based on semi-supervised learning corrects model flow figure;
Fig. 6 is microblog topic correct algorithm --- microblogging number is averaged adding rate result figure;
Fig. 7 is microblog topic correct algorithm --- the mean hit rate result figure of additional microblogging.
Specific implementation mode
Specific implementation mode one:As shown in Figure 1, a kind of Chinese microblog topic information processing method includes the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform Method of completing the square judges whether microblogging is related, and the language technology platform is Harbin Institute of Technology's language technology platform (Che W, Li Z, Liu T.Ltp:A chinese language technology platform[C]//Proceedings of the 23rd InternationalConference on Computational Linguistics: Demonstrations.Association for Computational Linguistics,2010:13-16);
Step 2:Microblog topic is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if Topic carries out clustering processing;
Focus incident relevant microblog, first will be right by Hashtag matching ways into topic discovery module as input The topic title answered extracts, and excavates N number of topic by being based on the strategies such as the sequence of microblog number purpose and filtering, topic clusters mould Block passes through hierarchical clustering (Hierarchical Cluster) algorithm (Kaufman L, Rousseeuw P J.Finding groups in data:An introduction to cluster analysis [M] .John Wiley&Sons, 2009) with And some filtering rules, obtain K topic after candidate merging, by sort algorithm, the input extracted as crucial topic;
(1) hierarchical clustering algorithm
By comparing the similarity algorithm and Hashtag similarity of character string algorithms of TF/IDF, Hashtag is finally had selected The foundation that similarity of character string, i.e. similarity of character string are calculated as distance in cluster, wherein the TF/IDF is based on microblogging Word frequency/inverse document frequency, two kinds of calculating formula of similarity are as follows:
SimilarityTFIDF(SA,SB)=cosine (TFIDF (SA),TFIDF(SB)) (1)
S in formulaAFor microblogging text A, SBFor microblogging text B, SAAnd SBIn Hashtag character strings be respectively HAAnd HB, Cosine is the function for calculating vectorial cosine angle, and LCS is the longest common subsequence of two character strings, Edit Distance For editing distance, in formula (2), it is assumed that two and in Hashtag character strings be respectively and, arrange the longest of two character strings Common subsequence is longer, and editing distance is shorter, their similarity is higher, in order to make formula have universality, to two characters The numerical value of string similarity made normalized, i.e., front and back two parts of formula are respectively divided by character string HAAnd HBIn it is shorter The length of character string, i.e. min (Length (HA),Length(HB)) and longer character string length, i.e. max (Length (HA),Length(HB));
(2) topic cluster result sort algorithm
It is talked about according to microblogging number and cluster result according to microblogging number sort algorithm and formula (4) by contrast equation (3) The weighted connections sequence for inscribing number, has chosen and considers that more multifactor formula (4) is used as sort formula;
RankingScore (topic)=topicweibonumber (3)
RankingScore (topic)=log (topicweibonumber)·topicnum (4)
RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formulaweibonumberFor under topic The microblogging number contained, topicnumFor the topic number merged in result, in order to make index be comparable, to microblogging number Logarithm process is carried out, formula (4) considers the topic number under single cluster after cluster on the basis of formula (3);
Step 4:Microblog topic correct algorithm;
(1) initial input:Result after topic clustering order total K, including preceding S topic and rear U topic;
(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", U topic according to It is divided into collection U1 to be predicted and training counter-example collection U2 with the sequencing of similarity of S topic;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the non-seed collection U1 to be predicted of model prediction for obtaining training;
(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging It is deleted from collection U1 to be predicted;
(6) it is started the cycle over from (2) step, until reaching loop termination condition, i.e. the adding rate that S topic corresponds to microblogging is small In threshold value;
(7) final output:From the S topic and its relevant microblog of expansion;
The automatic extraction feature template of the crucial topic of table 1
Step 5:It is evaluated using 5 indexs of accuracy rate@;
The superiority-inferiority of the final ranking results using 5 index reflection algorithm of accuracy rate is averaged adding rate using microblogging number Evaluation index with the mean hit rate of additional microblogging as microblogging enlarging itself algorithm;
The definition of 5 index of accuracy rate is that 5 most preceding correct topic numbers of prediction of ranking results are answered with preceding 5 standards The ratio of topic number in case, i.e. formula (5):
Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula (6):
The mean hit rate of additional microblogging, i.e., be appended in algorithm number that the microblogging of existing topic is correctly hit with it is current The microblogging number ratio of topic, i.e. formula (7):
Specific implementation mode two:The present embodiment is different from the first embodiment in that:S topic in the step 4 The threshold value value of the adding rate of corresponding microblogging is 0.1.
Specific implementation mode three:The present embodiment is different from the first and the second embodiment in that:In step 4, in step (7) after obtaining final output, return repeats step (1) to step (7) again, and initial input is that step (7) is final The S topic and its relevant microblog from expansion of output.
Embodiment one:
It is found in actual analysis microblogging event, the emotion distribution of a microblog hot event was changed with the time 's.One of microblog hot event such as in May, 2015 " MERS invades Guangdong " event, its emotion are distributed broken line (from May 22 To June 4) as shown in Fig. 2, five kinds of broken lines respectively represent happy, indignation, sad, frightened and surprised five kinds of emotions variation.It can be with It was found that since May 22 MERS viruses there is the first in South Korea, event microblogging starts to capture, and in subsequent several days, event exists Constantly ferment.Mood in figure respectively in May 26, May 29, May 30 and equi-time point netizen on June 1 is distributed all It is varied from.
Why netizen as time goes by, can reflect that different emotions is distributed for above-mentioned eventBy systematically Analysis finds that in May 26, news report " South Korea MERS virus carrier makes a definite diagnosis in Guangdong ", everybody mood is frightened adds It is sad;May 29, rumor " the Virus patients' aggravation for receiving isolation treatment " on microblogging, the mood of netizen switchs to frightened residence It is more;The problems such as May 30, news disclosed " the supervision loophole and technology of South Korea's medical department are weak ", everybody mood switchs to again Indignation is in the majority;June 1, major website records were " doctors and nurses of Huizhou hospital ICU in order not to make epidemic situation spread, in lot The touching deed on hilllock ", it is domestic it is a piece of encourage, support and cry moment of praise it is surging get up, by it from is in mood Happy ratio goes up.
As can be seen that the emotion point changed as the time elapses from the analysis of " MERS poisoning intrusions Guangdong " event Cloth, what is represented in fact is different views of the netizen to the not ipsilateral of event.Such as the emotion distribution of " fear " is needle in event It is " South Korea's medical department supervises loophole " etc. to be directed to topic to news " Virus patients' aggravation ", " indignation ".The present invention In terms of by the different discussion of event, it is defined as the topic of individual event.
As shown in figure 3, a kind of Chinese microblog topic information processing method includes the following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through keyword using language technology platform Method of completing the square judges whether microblogging is related;
The present invention has collected in May, 2015 to the microblog hot event occurred between July 15 and its relevant microblogging such as table 2 It is shown;
Table 2 tests corresponding 15 microblog hot event parts
Second title for being classified as microblog hot event in table 2, third are classified as corresponding microblogging number, and microblogging sum is total 1210000.
Step 2:The crucial topic of microblogging is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein described Hashtag is topic information, i.e., the word in microblogging between two " # " symbols;
It inscribes if representing not ipsilateral to most popular 5 of event described in step 1 and is manually marked, as standard Answer, few examples are as shown in table 3.
3 experimental data of table marks sample
Five that " Qingan County's shooting incident " and mark are listed in table 3 do not inscribe if ipsilateral.
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts Work refers to that topic information described in microblogging is carried out extraction summary, the clustering order of topic refer to first by part it is similar if Topic carries out clustering processing, such as " take journey official website be hacked or be man-made calamity " and " take journey server be hacked or be man-made calamity " the two topics It is just very similar, while the topic after merging is discussed warmly degree sequence according to it, the flow chart of this some algorithm is as shown in Figure 4.
(1) hierarchical clustering algorithm
Invention describes distance calculating methods when two kinds of hierarchical clusterings, are based on microblogging word frequency/inverse document frequency respectively Rate similarity based method (formula 1) and be based on Hashtag similarity of character string method (formula 2), topic clustering distance algorithm comparison Testing P@5, the results are shown in Table 4:
4 topic clustering distance computational methods contrast experiment of table
TF/IDF methods (being based on microblogging word frequency/inverse document frequency similarity based method) are can be seen that from the result in table 4 Accuracy rate will be far below be based on similarity of character string computational methods.Due to the coincidence of word used in the relevant microblogging of individual event It spends bigger (such as frequently occurring the vocabulary such as " taking journey ", " server " in " taking journey official website to be hacked " event relevant microblog), interferes Cluster process, causes the optimal similarity threshold of hierarchical clustering to be difficult to determine.The similarity of character string of Hashtag is used instead Method achieves good effect by the iterative process of hierarchical clustering in this paper tasks.
(2) topic cluster result sort algorithm
The present invention has carried out topic cluster result sort algorithm contrast experiment, i.e., simply (public according to the sequence of microblogging number Formula 3) and according to the weighted connections of microblogging number and cluster result topic number sequence (formula 4), this secures level in testing poly- The distance calculation formula of class is similarity of character string method.Experimental result is as shown in table 5:
5 topic cluster result sort algorithm contrast experiment of table
If can be seen that from the experimental result in table 5 merely according to the corresponding microblogging number of cluster after cluster as row Sequence index, sort method of the effect not as good as the topic information of number under cluster is added.The reason is that due to being clustered under single cluster Topic number also largely represents the degree of discussing warmly of the topic, i.e. topic contained by topic cluster result is more, discusses warmly Degree is higher.
Step 4:Microblog topic correct algorithm;
Clustering algorithm can tentatively return to a series of cluster results to have sorted, since cluster result is concentrated mainly on microblogging Hashtag character string case shells on, so cluster is easy to make the similar microblogging clusters of Hashtag together.But in reality In the language material of border, there are many Hashtag and the unmatched phenomenon of content of microblog, as shown in table 6:
Hashtag mismatches example with content in 6 microblogging of table
As can be seen that if the corresponding microbloggings of Hashtag are simply carried out emotion point from two microblogging examples of table 6 Cloth counts, and the corresponding emotion maldistribution of topic can be caused true, or even occurs in the topic that certain apparent masses have support mood Have in " the quiet haze video of bavin " in negative emotions, such as table 6, netizen be in fact to " contaminating enterprises " hold negative emotions rather than To haze video.
The work that step 4 is done is unfolded for above-mentioned phenomenon, and target is by Hashtag information and microblogging text The contradictory part microblogging of theme talked about carries out topic correction (the relevant microblogging of event is divided under correct theme), Achieve the effect that the relevant microblogging of topic, this task definition is that microblog topic correction is appointed by the present invention further from expanding Business, then algorithm can averagely give data set to increase how many microblogging, while the hit rate of additional microblogging how about, be evaluation this The index of a algorithm quality, the flow chart of this part algorithm are as shown in Figure 5.
It is intended to assess this two, devises microblogging number herein and be averaged the mean hit rate of adding rate and additional microblogging Two indices.Since algorithm is the semi-supervised learning algorithm of an iteration, so using line chart come visual representation experimental result, cross Axis is iterations, and the longitudinal axis is respectively two indexs, as shown in Figure 6, Figure 7.
Experiment language material has used the microblogging event data introduced in step 1.As shown in fig. 6, during algorithm iteration, Microblogging number adding rate increases to peak value 16.9% then from the 11.6% of beginning and reduces.Pass through comparative experimental data, reason It it is iteration initial stage, the training microblogging quantity added with disaggregated model increases, and causes to make the feature of model to increase, so as to pre- Survey more samples;Later stage due to remaining forecast sample number deficiency, causes adding rate to reduce.
By being sampled mark to additional microblogging, the experimental result of Fig. 7 has been obtained.The results show that additional microblogging Mean hit rate from the first round iteration when 88.5% to the 9th wheel 44.5%, reduce always.Cause the original of this result Because being as model iteration updates, one side training corpus is constantly increasing, and on the other hand training noise is also constantly adding up, most Grader classification capacity is caused constantly to weaken eventually.
Final present invention microblogging number of having compromised be averaged the mean hit rate of adding rate and addition microblogging, and iterations are set 2 are set to, ensure that the optimal from the hit rate resultant effect for expanding number and additional microblogging of microblogging.

Claims (3)

1. a kind of Chinese microblog topic information processing method, which is characterized in that the treating method comprises following steps:
Step 1:The judgement of focus incident relevant microblog;
The relevant microblog for inputting single focus incident to Text Pretreatment and passes through Keywords matching side using language technology platform Method judges whether microblogging is related;
Step 2:The crucial topic of microblogging is found;
By counting the Hashtag information in microblogging, the topic information in focus incident microblogging is excavated, wherein the Hashtag For topic information, i.e., the word in microblogging between two " # " symbols;
Step 3:The clustering order algorithm of topic;
After obtaining the relevant microblog of focus incident, topic extraction and clustering order are carried out first, wherein the topic extracts work Refer to that topic information described in microblogging is subjected to extraction summary, the clustering order of topic refer to first by the similar topic in part into Row clustering processing;
(1) hierarchical clustering algorithm
Using Hashtag similarity of character string algorithms, i.e., the foundation that similarity of character string is calculated as distance in cluster calculates public Formula is as follows:
The wherein described HAAnd HBFor SAAnd SBIn Hashtag character strings, SAFor microblogging text A, SBFor microblogging text B, LCS two The longest common subsequence of a character string, Edit Distance are editing distance, are made to the numerical value of two similarity of character string Normalized, i.e. front and back two parts of formula are respectively divided by character string HAAnd HBIn min (Length (HA),Length (HB)) and max (Length (HA),Length(HB));
(2) topic cluster result sort algorithm
Using according to the weighted connections of microblogging number and cluster result topic number as sort formula;
RankingScore (topic)=log (topicweibonumber)·topicnum (4)
RankingScore (topic) is the corresponding sequence scores of topic topic, topic in formulaweibonumbe rTo contain under topic Some microblogging numbers, topicnumFor the topic number merged in result, logarithm process has been carried out to microblogging number;
Step 4:Microblog topic correct algorithm;
(1) initial input:Result after topic clustering order total K, including preceding S topic and rear U topic;
(2) S topic is divided into " kind sub-topic " before, and rear U topic is divided into " non-seed topic ", and U topic is according to a with S The sequencing of similarity of topic is divided into collection U1 to be predicted and training counter-example collection U2;
(3) feature extraction and model training are carried out to the language material of S topic;
(4) the non-seed collection U1 to be predicted of model prediction for obtaining training;
(5) microblogging classification results probability in U1 is more than threshold value being added directly into corresponding S topic, while by microblogging from waiting for It is deleted in forecast set U1;
(6) it is started the cycle over from (2) step, until reaching condition of the adding rate less than threshold value that S topic corresponds to microblogging, completes to follow Ring;
(7) final output:From the S topic and its relevant microblog of expansion;
Step 5:It is evaluated using 5 indexs of accuracy rate@;
Using the superiority-inferiority of the ranking results of 5 index reflection algorithm of accuracy rate, using microblogging number be averaged adding rate and addition it is micro- Evaluation index of the rich mean hit rate as microblogging enlarging itself algorithm;
5 indexs of accuracy rate@are that 5 most preceding correct topic numbers of prediction of ranking results are talked about with preceding 5 model answers Inscribe the ratio of number, i.e. formula (5):
Microblogging number is averaged the adding rate average value that adding rate is the relevant microblogging of each topic from after expanding, i.e. formula (6):
The mean hit rate of additional microblogging, i.e., be appended to the number and actualite that the microblogging of existing topic is correctly hit in algorithm Microblogging number ratio, i.e. formula (7):
2. a kind of Chinese microblog topic information processing method according to claim 1, it is characterised in that S in the step 4 The threshold value value that a topic corresponds to the adding rate of microblogging is 0.1.
3. a kind of Chinese microblog topic information processing method according to claim 1 or 2, which is characterized in that in step 4, After step (7) obtains final output, return repeats step (1) to step (7) again, and initial input is step (7) the S topic and its relevant microblog from expansion of final output.
CN201510627783.0A 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method Active CN105354216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510627783.0A CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510627783.0A CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Publications (2)

Publication Number Publication Date
CN105354216A CN105354216A (en) 2016-02-24
CN105354216B true CN105354216B (en) 2018-09-07

Family

ID=55330189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510627783.0A Active CN105354216B (en) 2015-09-28 2015-09-28 A kind of Chinese microblog topic information processing method

Country Status (1)

Country Link
CN (1) CN105354216B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740466A (en) * 2016-03-04 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for excavating incidence relation between hotspot concepts
CN105868415B (en) * 2016-05-06 2019-08-09 黑龙江工程学院 A kind of microblogging real time filtering model based on historical weibo
CN106503064B (en) * 2016-09-29 2019-07-02 中国国防科技信息中心 A kind of generation method of adaptive microblog topic abstract
CN110020147A (en) * 2017-11-29 2019-07-16 北京京东尚科信息技术有限公司 Model generates, method for distinguishing, system, equipment and storage medium are known in comment
CN109255121A (en) * 2018-07-27 2019-01-22 中山大学 A kind of across language biomedicine class academic paper information recommendation method based on theme class
CN109299280B (en) * 2018-12-12 2020-09-29 河北工程大学 Short text clustering analysis method and device and terminal equipment
CN110795943B (en) * 2019-09-25 2021-10-08 中国科学院计算技术研究所 Topic representation generation method and system for event
CN110852076B (en) * 2019-10-12 2023-05-30 云知声智能科技股份有限公司 Method and device for automatic disease code conversion
CN111046282B (en) * 2019-12-06 2021-04-16 北京房江湖科技有限公司 Text label setting method, device, medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
US8180775B2 (en) * 2010-06-23 2012-05-15 National Central University Computer-implemented method for clustering data and computer-readable medium encoded with computer program to execute thereof
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN104516947A (en) * 2014-12-03 2015-04-15 浙江工业大学 Chinese microblog emotion analysis method fused with dominant and recessive characters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
US8180775B2 (en) * 2010-06-23 2012-05-15 National Central University Computer-implemented method for clustering data and computer-readable medium encoded with computer program to execute thereof
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN104516947A (en) * 2014-12-03 2015-04-15 浙江工业大学 Chinese microblog emotion analysis method fused with dominant and recessive characters

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"中文微博情感倾向性分析特征工程";李泽魁等;《山西大学学报(自然科学版)》;20141115;第37卷(第4期);第570-579页 *
"基于半监督学习的微博情感倾向性分析";朱玺等;《山东大学学报(理学版)》;20141130;第49卷(第11期);第37-42页 *
"文本情感分析";赵妍妍等;《journal of software》;20100831;第21卷(第8期);第1834-1848页 *

Also Published As

Publication number Publication date
CN105354216A (en) 2016-02-24

Similar Documents

Publication Publication Date Title
CN105354216B (en) A kind of Chinese microblog topic information processing method
Unankard et al. Emerging event detection in social networks with location sensitivity
Yang et al. Emerging rumor identification for social media with hot topic detection
CN103177024A (en) Method and device of topic information show
CN110232149A (en) A kind of focus incident detection method and system
CN102929873A (en) Method and device for extracting searching value terms based on context search
Petkos et al. Two-level Message Clustering for Topic Detection in Twitter.
Berendsen et al. Pseudo test collections for training and tuning microblog rankers
CN108280057A (en) A kind of microblogging rumour detection method based on BLSTM
CN107180087B (en) A kind of searching method and device
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
Liang et al. Expert finding for microblog misinformation identification
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
Vu et al. Rumor detection by propagation embedding based on graph convolutional network
Ke et al. A novel approach for cantonese rumor detection based on deep neural network
CN113449204A (en) Social event classification method and device based on local aggregation graph attention network
Trummer et al. Mining subjective properties on the web
Liang et al. Enhancing content marketing article detection with graph analysis
Lin et al. Multiplex anti-Asian sentiment before and during the pandemic: introducing new datasets from Twitter mining
Campbell et al. Content+ context networks for user classification in twitter
CN111008285B (en) Author disambiguation method based on thesis key attribute network
Tang et al. Text semantic understanding based on knowledge enhancement and multi-granular feature extraction
Liu et al. An improved topic detection method for chinese microblog based on incremental clustering.
Zhou et al. Keyword extraction based on random forest and XGBoost-an example of fraud judgment document
Chen Improving the performance of Wikipedia based on the entry relationship between articles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210425

Address after: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin jizuo technology partnership (L.P.)

Patentee after: Harbin Institute of Technology Asset Management Co.,Ltd.

Address before: 150001 Harbin, Nangang, West District, large straight street, No. 92

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210611

Address after: Room 206-12, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.

Address before: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee before: Harbin jizuo technology partnership (L.P.)

Patentee before: Harbin Institute of Technology Asset Management Co.,Ltd.