CN107291886A - A kind of microblog topic detecting method and system based on incremental clustering algorithm - Google Patents

A kind of microblog topic detecting method and system based on incremental clustering algorithm Download PDF

Info

Publication number
CN107291886A
CN107291886A CN201710473108.6A CN201710473108A CN107291886A CN 107291886 A CN107291886 A CN 107291886A CN 201710473108 A CN201710473108 A CN 201710473108A CN 107291886 A CN107291886 A CN 107291886A
Authority
CN
China
Prior art keywords
mrow
word
msub
microblogging
mfrac
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710473108.6A
Other languages
Chinese (zh)
Inventor
王萌
王晓荣
梁伟鄯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University of Science and Technology
Lushan College of Guangxi University of Science and Technology
Original Assignee
Guangxi University of Science and Technology
Lushan College of Guangxi University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University of Science and Technology, Lushan College of Guangxi University of Science and Technology filed Critical Guangxi University of Science and Technology
Priority to CN201710473108.6A priority Critical patent/CN107291886A/en
Publication of CN107291886A publication Critical patent/CN107291886A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of microblog topic detecting method based on incremental clustering algorithm, the described method comprises the following steps:S1, obtains micro-blog information set;S2, is pre-processed to micro-blog information set;S3, after pretreatment operation, Feature Words are extracted according to word occurrence frequency, word in the distribution situation of microblogging text, word in the distribution situation of time window;Feature Words are assigned weight by S4, according to Feature Words and its weight by microblogging text vector;S5, topic merging is carried out using the similitude determination methods based on distance between vector.Microblog topic detecting system used in the above-mentioned microblog topic detecting method based on incremental clustering algorithm, including the analysis of micro-blog information collection module, microblogging pretreatment module, Feature Words extraction module, microblogging text vector module, microblog topic and merging module.The present invention obtains preferable effect in each side such as recall rate, accuracys rate, and the speed of service has a distinct increment compared to k means methods.

Description

A kind of microblog topic detecting method and system based on incremental clustering algorithm
Technical field
The invention belongs to topic detection technical field, and in particular to a kind of microblog topic detection based on incremental clustering algorithm Method and system.
Background technology
It is micro- particularly after the rise of web2.0 technologies with the swift and violent growth of development and its application of Internet technology The application of blog (i.e. microblogging) is released news in time fast due to it, and spread speed is fast, the various more and more netizens of circulation way Concern and like.Microblogging is a kind of Information Sharing based on customer relationship, the platform propagated and obtained, and can pass through interconnection Net, mobile Internet or some clients carry out real-time Information Sharing with propagating.Microblogging is sent out with the information content of the word of highest 140 Cloth message, and be equipped with picture, sound, the file of video and provide the user abundant, diversification Information Sharing and propagation.At present, Microblogging has turned into netizens and has expressed the Important Platform of itself all kinds of emotion, particularly at present country to network rumour hitting dynamics not Disconnected today improved, how an important subject for having become information security field is effectively managed to microblogging, its Middle the important point is how that the discovery to microblog topic on network and examination have become hot research problem.
In natural language understanding field, topic detection and tracking (topic detection and tracking, TDT) has been Through there is years of researches history, its target is exactly the development and change for detecting relevant information and tracking event, in its main research Appearance includes two parts, and Part I is topic detection, and topic detection is to cluster same topic in multiple collection of document;It is another Part is Topic Tracking, and the part is mainly moved back the dependent event under some same topic according to the order of time and followed the trail of. Due to the fast development of microblogging, TDT research is incorporated into microblogging by some current researchers by traditional text carrier In carrier, reached by the topic detection and tracking to microblogging and find the real-time of hot microblog topic and hot microblog topic in time The problems such as progress.Microblogging is compared with traditional text, with text is short, user group's level difference is big, word is lack of standardization, style of writing lattice The various features such as formula is not rigorous, text spoken languageization is strong, because the presence of these features to microblog topic detection band carrys out very big difficulty. For these reasons, although topic detection research has been carried out for many years, but diversity due to data acquisition and feature extraction are not Certainty, current topic detection is concentrated mainly in the association area research such as news report, the research of relevant microblog topic detection It is relatively fewer.
More and more important role is played with the quick popularization of microblogging, and in internet life, it is domestic at present Some outer scholars also begin to carry out correlative study to microblog data, particularly in hot microblog topic context of detection.Rui Long Et al. propose a kind of method that validity event detection towards microblog data is followed the trail of, they to microblog data feature by gathering Alanysis determines descriptor to carry out the event detection in microblog data and tracking;Ramage et al. takes the potential language of mark Model, four potential dimensions are mapped to by microblogging text, and analysis result realizes that microblogging sorts, and is obtained using microblogging sequence Much-talked-about topic;Ma Bin et al. marks microblog data feature using threaded tree, and poly- to microblog data using bilateral clustering method Class, microblog topic is obtained using cluster result;Zheng Feiran et al. using the keyword largely occurred in on-line checking Twitter message, Hot microblog topic is obtained by the cluster to keyword;Xue Su sesames et al. are by finding microblogging inherent law, using identical In time window the growth rate of different themes word excavates focus Topic word at that time, and the cluster of focus descriptor is produced Raw much-talked-about topic.
Traditional topic detection model, in the case where microblog topic discusses this noisy environment, treatment effect is unsatisfactory, and it is led Want reason to be mainly content of microblog to be made up of the text no more than 140 words, comprising content be considerably less than traditional text, Simultaneously some special forms, such as " # themes # ", "@user " etc. are also included in microblogging.In addition, microblogging is used as network social intercourse Contain substantial amounts of network words inside instrument, these did not often occur in traditional text, such as " children's footwear ", " old bird ", " younger sister's paper " etc..Microblogging text also has very big difference with traditional text in structure, and microblogging text is shorter, therefore, uses vector The problems such as characteristic vector is sparse must occur in spatial model (Vector Space Model, VSM) the modeling trends of the times.Therefore, entering During the detection of row hot microblog topic either in microblogging Text Pretreatment method, or in microblogging feature extraction and much-talked-about topic All there is significant difference with traditional topic detection model in the method for cluster.
The pertinent literature detected on microblog topic, we find as follows:
1st, application number:201110164560.7, denomination of invention:Microblog topic detecting method and system, this method include step Suddenly:S1, is vocabulary by microblogging text dividing;S2, construction microblogging text clue and microblogging text forest;S3, for specific micro- This clue of blog article, carries out microblog topic analysis, to find out the main topic and noise topic in microblogging text clue;S4, for every Individual microblogging text clue, merges the microblogging text in its main topic, so as to generate a microblogging line for each microblogging text clue Suo Wenben;S5, carries out global microblog topic analysis, so as to detect global microblog topic, forms microblog topic storehouse.The invention It is disadvantageous in that:The invention needs first to construct microblogging text clue, forms microblogging text forest, need to carry out substantial amounts of microblogging words Topic analysis, forms microblog topic storehouse, and effect of such invention in specific area can clearly, but in microblogging explosion type Processing speed in mobile Internet relatively can be relatively slow, and the discovery effect to timely much-talked-about topic may not be obvious.
2nd, application number:201310017814.1, denomination of invention:A kind of microblog hot event based on monitoring subnet is examined in real time Method and system are surveyed, this method includes:1) microblogging monitoring subnet is built, based on user activity, influence power and response time structure Build the microblogging monitoring subnet containing a small amount of key user;2) microblog data real-time collecting, at regular intervals cycle real-time collecting The new microblogging of all user's issues of microblogging monitoring subnet;3) participle is carried out to the new microblogging of collection and topic merges;4) build, look into Ask and update topic list;5) certain time window is based on, is changed according to the number for participating in certain topic in topic list and carried out Focus incident is adjudicated.The invention is disadvantageous in that:The invention groundwork concentrates on the monitoring to microblogging event and user Aspect, in being found to microblog topic, the invention is judged mainly to monitor sample in the different characteristic of user.
3rd, application number:201310177797.8, denomination of invention:Method and dress that a kind of hot information based on microblogging is extracted Put, the invention provides the method and apparatus that a kind of hot information based on microblogging is extracted, wherein methods described includes:Obtain micro- Rich data acquisition system;The characteristic information extraction from the microblog data set, it is special that the characteristic information includes text feature, sequential Levy, social networks feature;According to the text feature, temporal aspect, social networks feature clustering into one or more topics;Carry The critical event factor of each topic is taken, hot information will be constituted based on the critical event factor.The weak point of the invention It is:The invention is mainly the hot information extracted in microblogging, rather than carries out the topic detection of microblogging, it may be said that microblogging focus is believed Breath extraction is a part of microblog topic detection.
Therefore a kind of fast and accurately microblog topic detecting method and system are provided, so that microblogging search hit rate is improved, Just seem particularly necessary.
The content of the invention
The invention aims to overcome the deficiencies in the prior art to be detected there is provided a kind of fast and accurately microblog topic Method and system, so as to improve microblogging search hit rate.
The present invention is achieved by the following technical solutions:
A kind of microblog topic detecting method based on incremental clustering algorithm, the described method comprises the following steps:
S1, obtains micro-blog information set;
S2, is pre-processed to micro-blog information set;
S3, after pretreatment operation, according to word occurrence frequency, word the distribution situation of microblogging text, word when Between the distribution situation of window extract Feature Words;
Feature Words are assigned weight by S4, according to Feature Words and its weight by microblogging text vector;
S5, topic merging is carried out using the similitude determination methods based on distance between vector.
Further, the micro-blog information of the step S1 and S2 includes user profile and microblogging text.
Further, pre-process and carry out in accordance with the following steps in the S2:
(1) user profile that number of listening to is less than threshold value F is deleted;
(2) directive property dialogue interaction information is ignored;
(3) participle is carried out to microblogging text, retains verb, nouns and adjectives.
Further, the threshold value F is set as 30.
Further, the step of S3 steps extract Feature Words specifically includes:
1. word occurrence frequency model is set up:
Wherein, | D | pretreated microblogging text collection sum is represented, | (d:ti∈ d) | represent word i text occur Number, here
D represents certain document in collection of document, and ti represents the number of times that some word occurs in all collection of document;
It is according to word occurrence number in pending microblogging text after pretreatment to determine word occurrence frequency average value E, E value Average value;Meet:
k1For word quantity to be handled after pretreatment;
Selection frequency of occurrence values is not less than average value E word, into the extraction of next step;
2. distributed model of the word in microblogging text collection is set up, is calculated using maximum informational entropy:
In formula:
H(d|ti)=- ∑ p (j | i) log p (j | i), (4);
H (d)=logNt, (6);
Wherein, tfijRepresent the number of times that word i occurs in microblogging text j, gfiRepresent word i after S3 1. step Microblogging
The total degree occurred in text collection, NtRepresent the sum of the microblogging text collection after S3 1. step;Meanwhile, really Determine most
The average value G of big comentropy is met:
Wherein, k2Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than flat Average G
Word, into the extraction of next step;
3. distributed model of the word in time window is set up, is calculated using maximum informational entropy:
TT (i)=- ∑ pi*log(pi) (8),
In formula,
Wherein, tfiRepresent the frequencies of occurrences of the word i in some time window, gfiRepresent word i after S3 2. step Microblogging text collection in the total degree that occurs;Meanwhile, the average value H of maximum informational entropy is determined,
Wherein, k3Word quantity to be handled after 2. step for S3, selection word Annual distribution most comentropy is less than Average value H word, as Feature Words.
Further, the form of expression of the microblogging vectorization is:Di(t1, W1;t2, W2…;ti, Wi), tiRepresent special
Levy item, WiRepresent weight, the WiSpan in [0,1],
tfijIt is characterized a tiIn microblogging document DiThe number of times of middle appearance, N is the total textual data of microblogging after S3 steps, niFor bag T containing characteristic itemiNumber.
Further, in the step S5, topic merging is carried out using the similitude determination methods based on distance between vector, Specifically
Operating procedure is:
(1) microblogging model counter is set, and initial value is 0;The vectorial model of microblogging is successively read, if it 0 is that counter, which is, One microblogging model, then directly create a new topic class Cl, counter is increased 1;If it is not, going to next step;
(2) after newly-increased microblogging model is read, it is calculated first with all existing topic centers apart from d (Di, Cl), choosing Select out that i.e. d of minimummax(Di, Cl), judge dmax(Di, Cl)<Whether ρ sets up, if setting up newly-increased microblogging model DiIt is incorporated to Topic class ClIn, recalculate new topic central point:Calculation formula is:
If not, then handled microblogging text Di as a new topic class;Represent a topic center Point, ρ represents distance between microblogging model, μ1Weighting parameters are represented, wherein,k′1It is to cluster the microblogging set number in l, will Counter increases 1;
(3) when reaching microblogging the last item model, then algorithm is terminated.
Further, the computation model of the ρ values is:(1) in the case of having historical data,
(2) in the case of without historical data, meet:
Wherein, m represents microblogging dimension of a vector space, and n represents clustered number, and k is represented in cluster centre ClIn Comprising microblogging model number;WijRepresent microblogging document DiIn each term weighing, W1jRepresent C in cluster1Middle term weighing.
Microblog topic detecting system used in a kind of microblog topic detecting method based on incremental clustering algorithm, its feature It is to include with lower module:
(1) micro-blog information collection module, all new microbloggings of the real-time collecting microblog within the time cycle;
(2) microblogging pretreatment module, the noise data for filtering microblogging text;
(3) Feature Words extraction module, for extracting Feature Words according to word occurrence frequency, word distribution situation;
(4) microblogging text vector module, for according to Feature Words and its weight, by microblogging text vector, it to be showed Form is:Di(t1, W1;t2, W2…;ti, Wi), tiRepresent characteristic item, WiRepresent weight;
(5) microblog topic analysis and merging module, judge for the similitude distance vector, then by distance Close topic merges.
Further, the microblogging pretreatment module includes information processing module of user's, at directive property dialogue interaction information Manage module and word-dividing mode;The information processing module of user's, threshold value F user profile is less than for deleting number of listening to; The directive property dialogue interaction message processing module, for ignoring directive property dialogue interaction information;The word-dividing mode, for pair Microblogging text carries out participle, retains or delete the word for specifying part of speech;
The Feature Words extraction module, including word frequency processing module, word distribution process module;The word frequency handles mould Block, the occurrence frequency for counting word, and the word of assigned frequency can be screened;The word distribution process module, for uniting Distribution situation of the word in microblogging text or time window is counted, and the word for specifying Distribution Value can be screened.
Compared with prior art, the beneficial effects of the invention are as follows:The present invention passes through the progress to microblog account, network words Pretreatment, reduces microblogging noise information amount;Feature Words are carried out using information such as microblogging word part of speech, the frequency of occurrences, distribution situations Extract;Microblog topic detection is carried out using increment Clustering Model.Draw this method in each side such as recall rate, accuracys rate by experiment Face obtains preferable effect, and the speed of service has a distinct increment compared to k-means methods.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of microblog topic detecting method of the present invention;
Fig. 2 is the structural representation of microblog topic detecting system of the present invention;
Fig. 3 is incremental clustering algorithm of the present invention and the performance indications comparison diagram of K-means clustering methods;
Fig. 4 is incremental clustering algorithm of the present invention and the run time comparison diagram of K-means clustering methods.
Embodiment
With reference to embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to The scope that embodiment is represented.
Embodiment:
A kind of microblog topic detecting method based on incremental clustering algorithm, as are shown in figure 1.2, methods described include following step Suddenly:
S1, for convenience of stating, the present invention obtains one group of micro-blog information set from network, as shown in table 1:
Table 1
S2, is pre-processed to micro-blog information set:
(1) user profile that number of listening to is less than threshold value F is deleted;7th article of microblogging is excluded due to bean vermicelli number less than 30 Outside pending microblogging;
(2) directive property dialogue interaction information is ignored;In the embodiment, the 7th article of microblogging comprising "@user " due to being rejected to Outside pending microblogging.
(3) participle is carried out to microblogging text, retains verb, nouns and adjectives.This method is according to Chinese Academy of Sciences's Chinese word segmentation System ICTCLAS2013 versions carry out participle, such as to " China keeps development to comply fully with U.S.'s interests." word segmentation result For:" China/ns holdings/v development/v completely/ad meets the/v U.S./ns interests/n.Participle is not only contained in/w ", the result As a result it further comprises the part of speech of word after participle.After participle, whether some networks or colloquial style word are included in content of microblog Remit these noise informations.After so by noise filtering, obtain microblogging as real as possible and carry out topic detection, improve In the degree of accuracy, the embodiment, the 8th article of microblogging is due to being also excluded from pending microblogging comprising network words and information content are very few Outside.
After microblogging has been pre-processed,
1. word occurrence frequency model is set up:
Wherein, | D | pretreated microblogging text collection sum is represented, | (d:ti∈ d) | represent word i text occur Number, d represents certain document in collection of document here, and ti represents the number of times that some word occurs in all collection of document;
It is according to word occurrence number in pending microblogging text after pretreatment to determine word occurrence frequency average value E, E value Average value;Meet:
k1For word quantity to be handled after pretreatment, k represents currently pending word;
Selection frequency of occurrence values is not less than average value E word, into the extraction word quantity of next step;
In the present embodiment, | D |=7, word " relation " all occurs in 3 documents | (d:ti∈ d) |=3, dfi=7, E =0.1853, the occurrence frequency of word is not less than average value E, into the extraction of next step;
2. distributed model of the word in microblogging text collection is set up, is calculated using maximum informational entropy:
In formula:
H(d|ti)=- ∑ p (j | i) log p (j | i), (4);
H (d)=logNt, (6);
tfijRepresent the number of times that word i occurs in microblogging text j, gfiRepresent microbloggings of the word i after S3 1. step The total degree occurred in text collection, NtRepresent the sum of the microblogging text collection after S3 1. step;Meanwhile, it is determined that maximum letter The average value G for ceasing entropy is met:
Wherein, k2Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than flat Average G word, into the extraction of next step;
In the present embodiment, the frequency tf that " relation " word occurs in first microblogging textij=2, " relation " word The total degree gf occurred in microblogging text collection after S3 1. stepi=7, Nt=7, then p (j | i)=0.2857, H (d | ti)=0.4315, GT (i)=0.2622;Institute word quantity k to be handled after S3 1. step2=38, then G=0.689;Word The word that maximum informational entropy is more than average value G is distributed, into the extraction of next step;
3. distributed model of the word in time window is set up, is calculated using maximum informational entropy:
TT (i)=- ∑ pi*log(pi) (8),
In formula,
tfiRepresent the frequencies of occurrences of the word i in some time window, gfiRepresent that word i is micro- after S3 2. step The total degree occurred in rich text collection;Meanwhile, the average value H of maximum informational entropy is determined,
Wherein, k3Word quantity to be handled after 2. step for S3, selection word Annual distribution maximum informational entropy is big In average value H word, as Feature Words.
In the present embodiment, " China " word is 5 in the frequency of occurrences of 4 hours windows, and " China " word is in S3 2. step after microblogging text collection in the total degree that occurs be 6, then pi=0.8333, TT (i)=0.0659;S3 2. step Institute's word quantity to be handled is 33 after rapid, then H=0.1031, " China ", " problem ", the Annual distribution maximum information of " interests " Entropy is less than average value H word, that is, elects Feature Words as.
Select after Feature Words, every microblogging is set up into term vector space model with feature word, at microblogging Reason counter is set to 0, is successively read vectorial model, if first, a new topic class is directly created, if it is not, then The note and the topic class existed are entered into row distance to calculate, specific formula for calculation is That minimum is selected after having calculated apart from dmax (Di, Cl), dmax (Di, Cl) is judged<Whether ρ sets up, will if setting up Newly-increased model Di is incorporated in topic class Cl, recalculates new topic central pointIf not into It is vertical, then by microblogging text DiHandled as a new topic class.The computation model of ρ values described above is:(1) when there is history In the case of data, (2) when without history number In the case of,So complete after cluster, the in above-mentioned example the 1st, 2,3,4,5 will be gathered for a class, master If description American vice president visits the related model stepped on and visited in China.Nearest focus words can be so found in numerous models Topic.
Each characteristic item of text is endowed a weight Wk, to represent important journey of this characteristic item in the text Spend, the form of expression of its specific microblogging text is:Di(t1,W1;t2,W2…;tn,Wn), tiItem is characterized, its specific selection is done Method is realized using the algorithm of part 2;And characteristic item tiWeight WiEntered using classical weighing computation method TF-IDF algorithms OK, i.e. Wik=TFik·IDFk.Influence due to text size to Feature item weighting, normalized is done to Features weight, will Each weights span control is arrived between [0,1], typically using equation below:
Wherein tfijIt is characterized a tiIn microblogging document DjThe number of times of middle appearance, N is total textual data, njTo include characteristic item tj Number.
K-means clustering methods are after Feature Words are selected, it is established that term vector space model, due to unknown poly- Class quantity, is generally required from 2 classes and proceeds by cluster, and cluster is arrived alwaysEvery time when cluster (1) from | D | it is individualData pair AsArbitrarily k object of selection is used as initial cluster center;(2) according to eachClusterObjectAverage(center object), calculates every Individual object and the distance of these center objects;And corresponding object is divided again according to minimum range;(3) recalculate every Individual (changing)Cluster'sAverage(center object);(4) canonical measure function is calculated, when meeting certain condition, such as function convergence When, then algorithm is terminated;Step (2) is returned to if condition is unsatisfactory for.K-means clustering methods are due to being that random center is clicked Select and make it that the fluctuation of result is larger, while need repeatedly to carry out k-means cluster compared with increment cluster to obtain result, So at runtime also than it is longer.
From figure 3, it can be seen that incremental clustering algorithm is relative to k-means algorithms, there is larger carry on indices Height, some performance indications are lifted beyond 20%.This is deleted largely not mainly due to incremental clustering algorithm in pretreatment stage Correlation word, has taken into full account micro-blog information feature in Feature Selection, and historical information has been taken into full account in increment clustering phase, All kinds of characteristic informations are enriched, the performance of detection algorithm is significantly improved.In traditional k-means algorithms, due to not drawing Enter the various structured messages of microblogging, the feature for adding microblogging is openness so that algorithm can not obtain satisfied performance.Meanwhile, Although introducing various features information in microblog topic detection algorithm, complexity is added, by incremental clustering algorithm Improve, reduce cluster number of times, its arithmetic speed can be improved.
Fig. 4 mainly compares run time of two different systems in microblog topic detection in the case of various number of documents. The generation of different document quantity passes through the document progress that correlated measure is randomly selected in corpus.Fig. 4 gives different system words Inscribe the comparison the time required to being run under detection algorithm.From operation result as can be seen that incremental clustering algorithm is calculated with respect to k-means Method, run time is less in the case of identical document quantity.This is due to be avoided k- after incremental clustering algorithm is by improvement The multiplicating cluster of different initial points in means algorithms, reduces cluster number of times, thus the time required to topic detection on Significantly decrease.

Claims (10)

1. a kind of microblog topic detecting method based on incremental clustering algorithm, it is characterised in that the described method comprises the following steps:
S1, obtains micro-blog information set;
S2, is pre-processed to micro-blog information set;
S3, after pretreatment operation, according to word occurrence frequency, word in the distribution situation of microblogging text, word in time window Mouthful distribution situation extract Feature Words;
Feature Words are assigned weight by S4, according to Feature Words and its weight by microblogging text vector;
S5, topic merging is carried out using the similitude determination methods based on distance between vector.
2. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that:The step Rapid S1 and S2 micro-blog information includes user profile and microblogging text.
3. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that:The S2 Middle pretreatment is carried out in accordance with the following steps:
(1) user profile that number of listening to is less than threshold value F is deleted;
(2) directive property dialogue interaction information is ignored;
(3) participle is carried out to microblogging text, retains verb, nouns and adjectives.
4. the microblog topic detecting method according to claim 3 based on incremental clustering algorithm, it is characterised in that:The threshold Value F is set as 30.
5. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that:The S3 The step of step extracts Feature Words specifically includes:
1. word occurrence frequency model is set up:
<mrow> <msub> <mi>df</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <mrow> <mo>(</mo> <mi>d</mi> <mo>:</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>&amp;Element;</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>D</mi> <mo>|</mo> </mrow> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein, | D | pretreated microblogging text collection sum is represented, | (d:ti∈ d) | represent word i textual data, d occur Represent certain document in collection of document;Ti represents the number of times that some word occurs in all collection of document;
Determine word occurrence frequency average value E, E value be according in pending microblogging text after pretreatment word occurrence number it is flat Average;E is met:
<mrow> <mi>E</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msub> <mi>df</mi> <mi>i</mi> </msub> </mrow> <msub> <mi>k</mi> <mn>1</mn> </msub> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Wherein, k is currently pending word, k1For word quantity to be handled after pretreatment;
Selection frequency of occurrence values is not less than average value E word, into the extraction of next step;
2. distributed model of the word in microblogging text collection is set up, is calculated using maximum informational entropy:
<mrow> <mi>G</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>|</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>d</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
In formula:
H(d|ti)=- ∑ p (j | i) log p (j | i), (4);
<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>j</mi> <mo>|</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>tf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mrow> <msub> <mi>gf</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
H (d)=logNt, (6);
Wherein, tfijRepresent the number of times that word i occurs in microblogging text j, gfiRepresent microbloggings of the word i after S3 1. step The total degree occurred in text collection, NtRepresent the sum of the microblogging text collection after S3 1. step;Meanwhile, it is determined that maximum letter The average value G for ceasing entropy is met:
<mrow> <mi>G</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>G</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>k</mi> <mn>2</mn> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, k2Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than average value G Word, into the extraction of next step;
3. distributed model of the word in time window is set up, is calculated using maximum informational entropy:
TT (i)=- ∑ pi*log(pi) (8),
In formula,
tfiRepresent the frequencies of occurrences of the word i in some time window, gfiRepresent microblogging texts of the word i after S3 2. step The total degree occurred in this set;Meanwhile, the average value H of maximum informational entropy is determined,
<mrow> <mi>H</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>T</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>k</mi> <mn>3</mn> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, k3Word quantity to be handled after 2. step for S3, selection word Annual distribution most comentropy is less than average value H word, as Feature Words.
6. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that:It is described micro- The form of expression of rich vectorization is:Di(t1, W1;t2, W2…;ti, Wi), tiRepresent characteristic item, WiRepresent weight, the WiTake It is worth scope in [0,1],
<mrow> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>tf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&amp;times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> </mrow> <msqrt> <mrow> <mi>&amp;Sigma;</mi> <mrow> <msub> <mo>&amp;Integral;</mo> <mrow> <mo>=</mo> <mn>1</mn> </mrow> </msub> <msup> <mrow> <mo>&amp;lsqb;</mo> <msub> <mi>tf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&amp;times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mn>2</mn> </msup> </mrow> </mrow> </msqrt> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
tfijIt is characterized a tiIn microblogging document DiThe number of times of middle appearance, N is the total textual data of microblogging after S3 steps, niTo include spy Levy a tiNumber.
7. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that:The step In rapid S5, topic merging is carried out using the similitude determination methods based on distance between vector, concrete operation step is:
(1) microblogging model counter is set, and initial value is 0;The vectorial model of microblogging is successively read, if it 0 is first that counter, which is, Microblogging model, then directly create a new topic class Cl, counter is increased 1;If it is not, going to next step;
(2) after newly-increased microblogging model is read, it is calculated first with all existing topic centers apart from d (Di, Cl), select That minimum is dmax(Di, Cl), judge dmax(Di, Cl)<Whether ρ sets up, if setting up newly-increased microblogging model DiIt is incorporated to topic Class ClIn, recalculate new topic central point:Calculation formula is:
<mrow> <msubsup> <mi>C</mi> <mn>1</mn> <mrow> <mi>n</mi> <mi>e</mi> <mi>w</mi> </mrow> </msubsup> <mo>=</mo> <msubsup> <mi>C</mi> <mn>1</mn> <mrow> <mi>o</mi> <mi>l</mi> <mi>d</mi> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;mu;</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>-</mo> <msubsup> <mi>C</mi> <mn>1</mn> <mrow> <mi>o</mi> <mi>l</mi> <mi>d</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
If not, then handled microblogging text Di as a new topic class;Represent a topic central point, ρ Represent distance, μ between microblogging model1Weighting parameters are represented, wherein,k′1It is to cluster the microblogging set number in l, will counts Device increases 1;
(3) when reaching microblogging the last item model, then algorithm is terminated.
8. the microblog topic detecting method according to claim 7 based on incremental clustering algorithm, it is characterised in that:The ρ The computation model of value is:(1) in the case of having historical data,
<mrow> <mi>&amp;rho;</mi> <mo>=</mo> <mfrac> <msqrt> <mi>m</mi> </msqrt> <mrow> <mi>n</mi> <mo>&amp;times;</mo> <mi>k</mi> </mrow> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msqrt> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>W</mi> <mrow> <mi>l</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>14</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
(2) in the case of without historical data, meet:
<mrow> <mi>&amp;rho;</mi> <mo>=</mo> <mn>0.3</mn> <mo>&amp;times;</mo> <mfrac> <msqrt> <mi>m</mi> </msqrt> <msqrt> <mn>2</mn> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>15</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein, m represents microblogging dimension of a vector space, and n represents clustered number, and k is represented in cluster centre ClIn include Microblogging model number;WijRepresent microblogging document DiIn each term weighing, W1jRepresent C in cluster1Middle term weighing.
9. it is micro- used in a kind of microblog topic detecting method based on incremental clustering algorithm as described in claim 1-8 is any Rich topic detection system, it is characterised in that including with lower module:
(1) micro-blog information collection module, all new microbloggings of the real-time collecting microblog within the time cycle;
(2) microblogging pretreatment module, the noise data for filtering microblogging text;
(3) Feature Words extraction module, for extracting Feature Words according to word occurrence frequency, word distribution situation;
(4) microblogging text vector module, for according to Feature Words and its weight, by microblogging text vector, its form of expression For:Di(t1, W1;t2, W2…;ti, Wi), tiRepresent characteristic item, WiRepresent weight;
(5) microblog topic analysis and merging module, judge for the similitude distance vector, then will be closely located Topic merge.
10. microblog topic detecting system according to claim 9, it is characterised in that:The microblogging pretreatment module includes Information processing module of user's, directive property dialogue interaction message processing module and word-dividing mode;The information processing module of user's, It is less than threshold value F user profile for deleting number of listening to;The directive property dialogue interaction message processing module, for ignoring finger Tropism dialogue interaction information;The word-dividing mode, for carrying out participle to microblogging text, retains or deletes the word for specifying part of speech Language;
The Feature Words extraction module, including word frequency processing module, word distribution process module;The word frequency processing module, is used In the occurrence frequency of statistics word, and the word of assigned frequency can be screened;The word distribution process module, for counting word In the distribution situation of microblogging text or time window, and the word for specifying Distribution Value can be screened.
CN201710473108.6A 2017-06-21 2017-06-21 A kind of microblog topic detecting method and system based on incremental clustering algorithm Pending CN107291886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710473108.6A CN107291886A (en) 2017-06-21 2017-06-21 A kind of microblog topic detecting method and system based on incremental clustering algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710473108.6A CN107291886A (en) 2017-06-21 2017-06-21 A kind of microblog topic detecting method and system based on incremental clustering algorithm

Publications (1)

Publication Number Publication Date
CN107291886A true CN107291886A (en) 2017-10-24

Family

ID=60097503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710473108.6A Pending CN107291886A (en) 2017-06-21 2017-06-21 A kind of microblog topic detecting method and system based on incremental clustering algorithm

Country Status (1)

Country Link
CN (1) CN107291886A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832467A (en) * 2017-11-29 2018-03-23 北京工业大学 A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN107832444A (en) * 2017-11-21 2018-03-23 北京百度网讯科技有限公司 Event based on search daily record finds method and device
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN109325117A (en) * 2018-08-24 2019-02-12 北京信息科技大学 Social security events detection method in a kind of microblogging of multiple features fusion
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN110069703A (en) * 2019-03-19 2019-07-30 南京大学 A kind of microblog topic detecting method based on feature enhancing
CN110232158A (en) * 2019-05-06 2019-09-13 重庆大学 Burst occurred events of public safety detection method based on multi-modal data
CN110309260A (en) * 2018-03-20 2019-10-08 株式会社斯库林集团 Text mining method, text mining storage medium and text mining device
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN112200197A (en) * 2020-11-10 2021-01-08 天津大学 Rumor detection method based on deep learning and multi-mode
CN112328795A (en) * 2020-11-13 2021-02-05 首都师范大学 Topic detection method and system based on key word element and computer storage medium
CN112527969A (en) * 2020-12-22 2021-03-19 上海浦东发展银行股份有限公司 Incremental intention clustering method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810280A (en) * 2014-02-19 2014-05-21 广西科技大学 Method for detecting microblog topics

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810280A (en) * 2014-02-19 2014-05-21 广西科技大学 Method for detecting microblog topics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周先琳: "基于动态Labeled_LDA模型的微博主题挖掘", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
张小明 等: "基于增量型聚类的自动话题检测研究", 《软件学报》 *
魏景璇: "基于KL距离的微博突发话题检测研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832444A (en) * 2017-11-21 2018-03-23 北京百度网讯科技有限公司 Event based on search daily record finds method and device
CN107832444B (en) * 2017-11-21 2021-08-13 北京百度网讯科技有限公司 Event discovery method and device based on search log
CN107832467A (en) * 2017-11-29 2018-03-23 北京工业大学 A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN108460019A (en) * 2018-02-28 2018-08-28 福州大学 A kind of emerging much-talked-about topic detecting system based on attention mechanism
CN110309260A (en) * 2018-03-20 2019-10-08 株式会社斯库林集团 Text mining method, text mining storage medium and text mining device
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN109325117A (en) * 2018-08-24 2019-02-12 北京信息科技大学 Social security events detection method in a kind of microblogging of multiple features fusion
CN109325117B (en) * 2018-08-24 2022-10-11 北京信息科技大学 Multi-feature fusion social security event detection method in microblog
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN109857869B (en) * 2019-01-26 2021-07-30 北京工业大学 Ap incremental clustering and network element-based hot topic prediction method
CN110069703A (en) * 2019-03-19 2019-07-30 南京大学 A kind of microblog topic detecting method based on feature enhancing
CN110232158A (en) * 2019-05-06 2019-09-13 重庆大学 Burst occurred events of public safety detection method based on multi-modal data
CN112200197A (en) * 2020-11-10 2021-01-08 天津大学 Rumor detection method based on deep learning and multi-mode
CN112328795A (en) * 2020-11-13 2021-02-05 首都师范大学 Topic detection method and system based on key word element and computer storage medium
CN112527969A (en) * 2020-12-22 2021-03-19 上海浦东发展银行股份有限公司 Incremental intention clustering method, device, equipment and storage medium
CN112527969B (en) * 2020-12-22 2022-11-15 上海浦东发展银行股份有限公司 Incremental intention clustering method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107291886A (en) A kind of microblog topic detecting method and system based on incremental clustering algorithm
CN106980692B (en) Influence calculation method based on microblog specific events
CN106598944B (en) A kind of civil aviaton&#39;s security public sentiment sentiment analysis method
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN103559233B (en) Network neologisms abstracting method and microblog emotional analysis method and system in microblogging
CN103745000B (en) Hot topic detection method of Chinese micro-blogs
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN104615608B (en) A kind of data mining processing system and method
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN103500175B (en) A kind of method based on sentiment analysis on-line checking microblog hot event
CN106354845A (en) Microblog rumor recognizing method and system based on propagation structures
CN105653518A (en) Specific group discovery and expansion method based on microblog data
CN105488092A (en) Time-sensitive self-adaptive on-line subtopic detecting method and system
CN103116605A (en) Method and system of microblog hot events real-time detection based on detection subnet
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CN107609103A (en) It is a kind of based on push away spy event detecting method
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN109949174B (en) Heterogeneous social network user entity anchor link identification method
CN107305545A (en) A kind of recognition methods of the network opinion leader based on text tendency analysis
CN111949848B (en) Cross-platform propagation situation assessment and grading method based on specific events
CN109885675A (en) Method is found based on the text sub-topic for improving LDA
Yan et al. An improved single-pass algorithm for chinese microblog topic detection and tracking
CN109636682A (en) A kind of teaching resource auto-collection system
Yuan et al. A hybrid method for multi-class sentiment analysis of micro-blogs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171024