CN104715014B - A kind of online topic detecting method of news - Google Patents

A kind of online topic detecting method of news Download PDF

Info

Publication number
CN104715014B
CN104715014B CN201510039493.4A CN201510039493A CN104715014B CN 104715014 B CN104715014 B CN 104715014B CN 201510039493 A CN201510039493 A CN 201510039493A CN 104715014 B CN104715014 B CN 104715014B
Authority
CN
China
Prior art keywords
topic
news
barycenter
cluster
initial classes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510039493.4A
Other languages
Chinese (zh)
Other versions
CN104715014A (en
Inventor
常会友
路永和
韦婷婷
胡勇军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201510039493.4A priority Critical patent/CN104715014B/en
Publication of CN104715014A publication Critical patent/CN104715014A/en
Application granted granted Critical
Publication of CN104715014B publication Critical patent/CN104715014B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of online topic detecting method of news, belongs to Computer Science and Technology field, it is contemplated that the network text for needing progress topic detection in internet, proposes a kind of more effective topic detecting method.It is main that initial clustering is carried out using X means algorithms come the text reached to certain amount or in a period of time by structure cluster buffering area, dual threshold thought (topic polymerization threshold value, topic barycenter update threshold value) is introduced, the drift of topic is effectively controlled and improves the effect of cluster.The effect that this method is obtained is superior to the Single Pass algorithms of classics in each evaluation index, more accurately identifies a need for the topic of topic detection.

Description

A kind of online topic detecting method of news
Technical field
The present invention relates to Computer Science and Technology field, more particularly, to a kind of online topic detection of Internet news Method.
Background technology
Topic detection (Topic Detection, TD) is topic detection and tracking (Topic Detection and Tracking, TDT) in one of five basic research tasks, it is mainly to detection and organization system unknown topic in advance Detected.TDT (Topic Detection and Tracking) project is by U.S. Department of Defense's advanced research projects agency (DARPA) subsidize, University of Massachusetts (University of Massachusetts), Univ Carnegie Mellon The project that (Carnegie Mellon University) and Dragon Systems companies joint are participated in.This project master If carrying out automated analysis to continuous news media's information, inscribed if detecting in the presence of it, and to having detected that Topic is tracked.The research of topic detection is opened under the background of TDT (Topic Detection and Tracking) project What exhibition was got up.For topic detection this task, Single-Pass algorithms are using relatively broad.Single-Pass is one Incremental Clustering Algorithm is planted, clustering mainly is carried out to text flow.The algorithm successively by the text reached with it is existing Cluster barycenter carry out Similarity Measure, if maximum Similarity value is more than or equal to a certain threshold value, the text is aggregated to phase In cluster like degree maximum, and recalculate the barycenter of the cluster;If maximum Similarity value is less than a certain threshold value, one is created newly Cluster, and the text is aggregated in the cluster newly created.
Hong Yu (evaluation and test of Hong Yu, Zhang Yu, Liu Ting topic detection and trackings and Review Study [J] Journal of Chinese Information Processings .2007,(06):71-87.) et al. the evaluation and test and research to topic detection and tracking are reviewed, describe topic detection with The main task and key technology of tracking, and its main language material and evaluating method.(Jia Ziyan, He Qing, the Zhang Junhai such as Jia Ziyan A kind of incident detection and tracing algorithm [J] Journal of Computer Research and Development .2004,41 (7) based on dynamic evolution model:1273- 1280.) the name entity such as name, place name for occurring in text is recognized, and according to the difference of its classification, gives difference Weight, finally use for reference Single-Pass Clusterings and design the incident detection and tracing algorithm of dynamic evolution model.Zhang Kuo (Zhang K,Zi J,Wu LG.New event detection based on indexing-tree and named entity[C].Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.2007:It is 215-22.) to be then based on χ 2 Distribution is counted to each entity class in corpus and the relevance of each topic classification, and is each according to the relevance of feature Individual feature assigns different weights.(Zhao Hua, Zhao Tiejun study the high skills of [J] to Zhao Hua in the topic detection of great towards dynamic evolution Art communication .2006,12 (16):1230-1235.) then the factor of sequential is taken into account, the border that topic develops is known Not.Gold bead (go away for some great undertakings, Topic Tracking and tendentiousness sort research [J] information journal of the Zhao Jing based on HowNet by gold bead, woods .2005,5(24):It is 555-561.) then to set up structuring topic model using HowNet, topic is entered from different sides Row description.
Synthesis has been researched and analysed, and presently mainly carries out cluster to find topic using Single-Pass algorithms. Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time series reached according to news, once A news is read to carry out incremental clustering analysis.But, such processing can bring a problem:The dynamic clustering stage by There is no any other text as reference when in feature extraction, text-processing seems excessively single, and then cause each topic Barycenter can the orders read in of Yin Wenben it is different and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass is calculated Method is that a single threshold value is come the topic belonging to dividing text according to as defined in advance during text polymerize with topic, very It is easily caused the drift of topic.
The content of the invention
The present invention is directed to propose a kind of online topic detecting method of more effective news, by introduce cluster buffering area come pair Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, can more accurately identify a certain field News associated topic.
To achieve these goals, the technical scheme is that:
A kind of online topic detecting method of news, is the online topic for detecting news, specifically includes:
Initialization:Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic gathers Threshold value lowTX is closed, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic
Stage one, initial static cluster:
S1. pre-process:According to the time sequencing of news briefing, the news or units issued in unit interval are read The news of amount, and these newsletter archives are pre-processed, by newsletter archive vectorization;
S2. X-means algorithms progress initial static cluster is used to the newsletter archive newly read in;
S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and it is initial to calculate each The barycenter of class, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic matter Each topic barycenter in heart list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and its Corresponding topic;
S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then creates new topic, this is first Beginning Type of Collective is added to topic barycenter list into newly-built topic, and by the barycenter of the initial classes as the barycenter of newly-built topic In CentroidList_Topic;It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial Type of Collective is into the maximum topic of similarity;Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter HighTX, then update the topic barycenter after polymerization;
S6. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial classes matter Deleted in heart list CentroidList_Cluster;
S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, once cluster point is completed Analysis;
S8. the newsletter archive of next unit interval or Board Lot is waited, S1 is gone to step.
Preferably, the pretreatment detailed process of the step S1 is:To the news or unit issued in unit interval The news of quantity is carried out after participle and part-of-speech tagging, extracts news element and field name entity, and according to news element and neck The diverse location that domain name entity occurs assigns different weights W1i, i represent news element and field name entity sets in i-th Individual;By weights W1iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by, each is obtained The weight of news element or field name entity in the text, by newsletter archive vectorization;Wherein name entity in field refers to need The specific term in topic detection domain is wanted, news element generally refers to the personages such as source, time, place, the event of news.
Preferably, the pretreatment of the step S1 also includes the news that will occur simultaneously in headline and body Element or field name entity are identified, and weights W is assigned again2i, weights W1i、W2iCorrespondence news element or neck are multiplied by respectively The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text Weight, by newsletter archive vectorization, wherein W1i<W2i
Preferably, the pretreatment of the step S1 also includes the selection of feature, by all weights W1i、W2iCarry out by size The weights of this T feature are multiplied by correspondence news element or entity are named in field by sequence, T big feature of selection weights respectively The number of times occurred in Internet news, obtains the weight of each news element or field name entity in the text, by news text This vectorization.
Preferably, it is to the mode that initial classes barycenter and topic barycenter carry out Similarity Measure in step S4:Seek initial classes The cosine value of angle between two vectors of barycenter and topic barycenter, cosine value is bigger, it was demonstrated that then similarity is also bigger.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
(1) in traditional topic detecting method, Text Pretreatment part is only merely a general text-processing mode, The text of a certain specific area or particular form is not distinguish between so as to excavate some special properties in the presence of it, Cause the missing that text semantic is represented.The present invention newsletter archive is pre-processed, excavation news essential element (time, Point, event, personage) and the domain class in specific term (i.e. field name entity), to each feature (news in news Element and field name entity) corresponding weight is assigned according to significance level, the degree of accuracy of feature extraction is improved, is effectively improved The quality of Text Pretreatment.
(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection. Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem:Dynamic is poly- The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic Affiliated topic, it is easy to cause the drift of topic.The present invention is directed to propose a kind of more effective topic detecting method, by drawing Enter to cluster buffering area and carry out initial clustering using X-means algorithms come the text reached to certain amount or in a period of time, Dual threshold thought (set up topic polymerization threshold value, topic barycenter and update threshold value) is introduced, the drift of topic is effectively controlled and improves poly- The effect of class.The effect that this method is obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately Identify food security associated topic.
Brief description of the drawings
Fig. 1 is the online topic detection flow chart that the present invention is clustered based on XMSP second orders.
Embodiment
The present invention is described further below in conjunction with the accompanying drawings, in the present embodiment with the news topic of food security Exemplified by detection.
Such as Fig. 1, a kind of online topic detecting method of news carries out the pretreatment stage of newsletter archive, by participle first After part-of-speech tagging, the identification news element field name entity related to topic detection field assigns corresponding weights, then Text vector is represented after the screening and weighting that carry out feature, clustering phase is finally sent into.In clustering phase, first carry out initial Static cluster, recycle proposed method to carry out dynamic secondary cluster.Final output result is one by one with one Or the news topic of a plurality of news composition.
Its XMSP second order clustering algorithm based on Single-Pass algorithm improvements is comprised the following steps that:
Initialization:Parameter maxNumClusters and minNumClusters in X-means, topic polymerization threshold value LowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_ Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic.
Input:In unit interval or Board Lot newsletter archive set
Output:Topic set (TopicSet)
Stage one, initial static cluster:
Step1. according to the time sequencing of news briefing, the news issued in unit interval (one day) is read (or single The news of bit quantity), and these newsletter archives are pre-processed, convert the text to vector space;
Step2. X-means algorithms progress initial static cluster is used to this collection of newsletter archive newly read in;
Step3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and at the beginning of calculating each The barycenter of beginning class, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
Step4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic Each topic barycenter in barycenter list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and Topic corresponding to it.
If the maximum similarity value Step5. drawn is less than topic and polymerize threshold value lowTX, new topic is created, by this Initial classes are aggregated in newly-built topic, and the barycenter by the barycenter of the initial classes as newly-built topic is added to In CentroidList_Topic.
If the Similarity value Step6. drawn is more than or equal to topic and polymerize threshold value lowTX, by the initial Type of Collective Into the maximum topic of similarity.
If the Similarity value Step7. drawn, which is more than or equal to topic barycenter, updates threshold value highTX, update after polymerization Topic barycenter.
Step8. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial Deleted in class barycenter list CentroidList_Cluster.
Step9. Step4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky.
Step10. a clustering is completed, the newsletter archive of next unit interval (or Board Lot) is waited Reach, go to step Step1.
In the present embodiment, the topic detecting method of the present invention is applied to the detection of food security, it is specific as follows:
(1) the related element of news of food security and name entity are extracted, and is assigned not according to the diverse location of its appearance Same weighting, increases the feature weight related to text subject, reduces the weight of uncorrelated features, improves the accurate of text representation Degree.Comprise the following steps that:
1) introduce Chinese Academy of Sciences's Words partition system (ICTCLAS) and combine field of food safety name entity storehouse (including Common food name and the harmful substance title for threatening food security), to the name appeared in body, place name, Institution term and food security name entity are identified, and give its weight W1i
2) to both occurring in headline, and the name entity occurred in body is identified, and assigns again Its weight W2i
3) W is utilized1i、W2iWord frequency (number of times that word occurs in the text) is multiplied by calculate the weight of each word in the text.
(2) present invention carries out softening processing to Single-Pass algorithms, it is proposed that XMSP second order clustering algorithms.The algorithm By introducing the cluster buffering area based on X-means algorithms come the cluster to Single-Pass algorithms one text of single treatment Mode carries out softening, is controlled beneficial to dual thresholds come polymerization respectively to topic and the renewal of topic barycenter, so that flexible Change the problem of processing threshold value is set.XMSP second orders clustering algorithm includes two clustering stages, i.e. initial static clustering phase With the dynamic clustering stage.
1) initial static clustering phase is to carry out cluster point by the X-means algorithms newsletter archive collection static to one Analysis.By setting up a news buffering area, in unit interval or Board Lot news in advance carry out an initial static Cluster, and no longer it is the news progress clustering of one one, so as to effectively alleviate news order of arrival to cluster point Influence caused by analysis.
2) the dynamic clustering stage is still the thought for having used for reference Single-Pass algorithms, but with Single-Pass algorithms Barycenter adjustable strategies are different, and the XMSP dynamic clustering stages introduce two similarity thresholds:Topic polymerize threshold value lowTX and Topic barycenter updates threshold value highTX (wherein, lowTX<highTX).Some initial classes are obtained by initial static cluster, then will The barycenter of the barycenter of these initial classes and already present each topic carries out Similarity Measure.When a certain initial classes and each topic barycenter Similarity maximum be more than topic polymerize threshold value lowTX when, then by the initial classes and similarity maximum topic polymerize. But this does not imply that also topic barycenter is updated, when only similarity is more than topic barycenter renewal threshold value highTX, Just the topic barycenter being polymerize is updated.
Embodiment 1
(1) analysis of cases corpus
The news report of the relevant food security collected from major Internet news media, the time of collection is 2012 November in January, 2013 in year.The corpus is made up of 1034 news about food security, totally 11 topics.These topics Respectively gutter oil, gold rice, Jiu Gui Jiu, rapid-result chicken, clenbuterol hydrochloride, U.S.A praise minister milk powder, ordeal bean bud, bright milk, gelatin fish Wing, edible oil be carcinogenic and running water safety.News record and Annual distribution contained by each topic is as shown in table 1 below.
The food security topic corpus of table 1
(2) evaluation method
Take the metric form of supervision to be estimated systematic function, that is, measure cluster label corresponding with topic label Degree.Wherein, cluster label refer to system according to clustering to the label given by a certain piece news, topic label is artificial root According to the label given by the true classification belonging to news.Here by count in each cluster each contained topic article number come The article number comprising topic j in corresponding relation, such as cluster i, which is set up, for cluster and topic at most, i.e., cluster i is labeled as topic j.
Wherein, nijRefer to that the news in topic j is divided into cluster i article number, n by systemiRefer to text included in cluster i Chapter number, njRefer to the article number included in topic j, n is the sum of article.Each topic is obtained by setting up corresponding relation Loss PMiss(i, j) and rate of false alarm PFa(i, j), the final missing inspection of the system that draws of being averaged to the result of calculation of all topics Rate PMissWith rate of false alarm PFa, average loss PMissWith average rate of false alarm PFaCalculation formula is as follows:
Initial point distance (Distance from Origin), the index is the comprehensive assessment to loss and rate of false alarm, its The performance of the lower expression system of value is better.The calculation formula of initial point distance index is as follows:
In addition to loss and rate of false alarm used in topic detection and tracking, traditional clustering algorithm evaluation refers to Mark includes precision ratio (Precision) Precision (i, j), recall rate (Recall) Recall (i, j) and F examines (F- Measure) F (i, j), the calculation formula of these three indexs is as follows:
Wherein, nijRefer in cluster i comprising the article number for belonging to class j, niRefer to article number included in cluster i, njRefer to Article number included in class j.
As asking loss and rate of false alarm, it is established that cluster and the corresponding relation of class, all kinds of precision ratios, recall rate are calculated And F is examined, and finally all kinds of drawn indexs are averaged, these three evaluation indexes of system are obtained, calculation formula is such as Under[31]
Wherein, k is topic number, RjRefer to the recall rate for class j, PjRefer to the precision ratio for class j, FjRefer to for Class j F test values.
(3) effect analysis
1) Experimental comparison of the Text Pretreatment method to topic Detection results:General pretreatment strategy (participle, goes to stop Word, tfidf weightings), introduce the identification weighting technique that news element and food security name entity.Clustered using this paper XMSP Method, the effect of topic detection is as follows:
Table 2 adds the operational effect contrast for news element and name entity being identified rear XMSP
2) cluster buffering area effect is introduced to compare:Threshold value (highTX) is updated to introduce topic barycenter in control XMSP algorithms Topic barycenter in XMSP algorithms is updated threshold value (highTX) and polymerize threshold value with topic by the influence to the experiment, the experiment (lowTX) it is arranged to same numerical value.Single-Pass algorithms in topic polymerization stage are come as a unit with a text Polymerize, its granularity is smaller, the span that threshold value TX takes be 0.02~0.24 between.Because when threshold value TX takes When 0.24, the cluster obtained by clustering has had arrived at 146, if threshold value TX takes higher value that more clusters only occur, this The value that sample is not also just counted.And XMSP algorithms are different from Single-Pass, it be with an initial classes with it is already present Topic is polymerize, and its fineness ratio is larger, thus XMSP algorithms threshold value lowTX (wherein, lowTX=highTX) be 0.33~ Best effect is just obtained when 0.35.When the two algorithms reach best effects, the contrast of each index is as shown in table 3.This is also tested The validity that cluster buffering area is added in XMSP algorithms is demonstrate,proved.
The optimal effectiveness that table 3 adds the XMSP algorithms and Single-Pass algorithms of cluster buffering area is compared
3) topic barycenter renewal threshold effect is introduced to compare:
First the topic polymerization threshold value lowTX in XMSP algorithms is fixedly installed as some value for this experiment.By adjusting words Topic barycenter updates threshold value highTX to observe the influence for introducing the threshold value to XMSP algorithm effects.Table 4 and table 5 are respectively lowTX During=0.33 and lowTX=0.42, the XMSP algorithm operational effects corresponding to different highTX.
Table 4 XMSP algorithms (lowTX=0.33) operational effect
Table 5 XMSP algorithms (lowTX=0.42) operational effect
By table 4 and table 5 as can be seen that updating threshold value highTX by introducing topic barycenter, and its value is adjusted, XMSP is calculated Method can obtain more preferable effect.Wherein, as lowTX=0.33, highTX values are at 0.45~0.47, and XMSP algorithms are obtained Best effect;As lowTX=0.42, highTX values are at 0.46~0.52, and XMSP algorithms obtain best effect.This Sample, which has also been turned out, introduces the validity that topic barycenter updates threshold value.
This implementation is pre-processed for the related Internet news text of food security, excavates the essential element of news (time, place, event, personage), corresponding weight is assigned according to significance level, improves the degree of accuracy of feature extraction.Food is pacified The characteristics of brand-new news maximum, is that the content described by it is the keyword of that is, one food security news based on food security Mainly it is made up of some specific terms (field of food safety name entity) in field of food safety, and these names are real Expression of the body to information contained in news has highly important effect.Therefore, the food security appeared in news is named Entity, which is identified and assigns its corresponding weight, can more efficiently improve the quality of Text Pretreatment.
(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection. Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem:Dynamic is poly- The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic Affiliated topic, it is easy to cause the drift of topic.The topic detecting method of the present embodiment, by introduce cluster buffering area come pair Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately identifies food security phase Close topic.

Claims (5)

1. a kind of online topic detecting method of news, it is characterised in that specifically include:
Initialization:Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic polymerization threshold Value lowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic
Stage one, initial static cluster:
S1. pre-process:According to the time sequencing of news briefing, the news issued in unit interval or Board Lot are read News, and these newsletter archives are pre-processed, by newsletter archive vectorization;
S2. X-means algorithms progress initial static cluster is used to the newsletter archive newly read in;
S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and calculate each initial classes Barycenter, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, is arranged with topic barycenter Each topic barycenter in table CentroidList_Topic carries out Similarity Measure, and record maximum similarity value and its institute are right The topic answered;
S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then new topic is created, by the initial classes It is aggregated in newly-built topic, and the barycenter of the initial classes is added to topic barycenter list as the barycenter of newly-built topic In CentroidList_Topic;It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial Type of Collective is into the maximum topic of similarity;Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter HighTX, then update the topic barycenter after polymerization;
S6. the initial classes and its corresponding class barycenter are arranged from initial classes set ClusterSet and initial classes barycenter respectively Deleted in table CentroidList_Cluster;
S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, a clustering is completed;
S8. the newsletter archive of next unit interval or Board Lot is waited, S1 is gone to step.
2. the online topic detecting method of news according to claim 1, it is characterised in that the pretreatment tool of the step S1 Body process is:The news or the news of Board Lot issued in unit interval are carried out after participle and part-of-speech tagging, extracted News element and field name entity, and the power different with the diverse location imparting that field names entity to occur according to news element Value W1i, i represent news element and field name entity sets in i-th individual;By weights W1iIt is multiplied by correspondence news element or neck The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text Weight, by newsletter archive vectorization;Wherein name entity in field refers to need the specific term in topic detection domain.
3. the online topic detecting method of news according to claim 2, it is characterised in that the pretreatment of the step S1 is also Including the news element occurred in headline and body or field name entity are identified simultaneously, assign again Weights W2i, weights W1i、W2iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by respectively, is obtained Each news element or field is taken to name the weight of entity in the text, by newsletter archive vectorization, wherein W1i<W2i
4. the online topic detecting method of news according to claim 3, it is characterised in that the pretreatment of the step S1 is also Selection including feature, by all weights W1i、W2iIt is ranked up by size, selects T feature of maximum weight, it is special by this T The weights levied are multiplied by the number of times that correspondence news element or field name entity occur in Internet news respectively, obtain each news The weight of element or field name entity in the text, by newsletter archive vectorization.
5. the online topic detecting method of news according to claim 4, it is characterised in that to initial classes barycenter in step S4 With topic barycenter carry out Similarity Measure mode be:Ask and press from both sides cosine of an angle between two vectors of initial classes barycenter and topic barycenter Value, cosine value is bigger, then similarity is also bigger.
CN201510039493.4A 2015-01-26 2015-01-26 A kind of online topic detecting method of news Expired - Fee Related CN104715014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510039493.4A CN104715014B (en) 2015-01-26 2015-01-26 A kind of online topic detecting method of news

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510039493.4A CN104715014B (en) 2015-01-26 2015-01-26 A kind of online topic detecting method of news

Publications (2)

Publication Number Publication Date
CN104715014A CN104715014A (en) 2015-06-17
CN104715014B true CN104715014B (en) 2017-10-10

Family

ID=53414341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510039493.4A Expired - Fee Related CN104715014B (en) 2015-01-26 2015-01-26 A kind of online topic detecting method of news

Country Status (1)

Country Link
CN (1) CN104715014B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488092B (en) * 2015-07-13 2018-05-22 中国科学院信息工程研究所 A kind of time-sensitive and adaptive sub-topic online test method and system
CN105468669B (en) * 2015-10-13 2019-05-21 中国科学院信息工程研究所 A kind of adaptive microblog topic method for tracing merging customer relationship
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN108021579B (en) * 2016-10-28 2021-10-15 上海优扬新媒信息技术有限公司 Information output method and device
CN108062319A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 A kind of real-time detection method and device of new theme
CN108170671A (en) * 2017-12-19 2018-06-15 中山大学 A kind of method for extracting media event time of origin
CN108334628A (en) * 2018-02-23 2018-07-27 北京东润环能科技股份有限公司 A kind of method, apparatus, equipment and the storage medium of media event cluster
CN109657164B (en) * 2018-12-25 2020-07-10 广州华多网络科技有限公司 Method, device and storage medium for publishing message
CN109902230A (en) * 2019-02-13 2019-06-18 北京航空航天大学 A kind of processing method and processing device of news data
CN110866555A (en) * 2019-11-11 2020-03-06 广州国音智能科技有限公司 Incremental data clustering method, device and equipment and readable storage medium
CN111324725B (en) * 2020-02-17 2023-05-16 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN113806528A (en) * 2021-07-07 2021-12-17 哈尔滨工业大学(威海) Topic detection method and device based on BERT model and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
EP2397985A1 (en) * 2010-06-15 2011-12-21 Honeywell International Inc. System for multi-modal data mining and organization via elements, clustering and refinement
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260586A1 (en) * 2006-05-03 2007-11-08 Antonio Savona Systems and methods for selecting and organizing information using temporal clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN101571853A (en) * 2009-05-22 2009-11-04 哈尔滨工程大学 Evolution analysis device and method for contents of network topics
EP2397985A1 (en) * 2010-06-15 2011-12-21 Honeywell International Inc. System for multi-modal data mining and organization via elements, clustering and refinement
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种面向网络话题发现的增量文本聚类算法;殷风景等;《计算机应用研究》;20110131;第28卷(第1期);第54-57页 *

Also Published As

Publication number Publication date
CN104715014A (en) 2015-06-17

Similar Documents

Publication Publication Date Title
CN104715014B (en) A kind of online topic detecting method of news
Aldino et al. Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency
CN106447285B (en) Recruitment information matching method based on multi-dimensional domain key knowledge
CN103207899B (en) Text recommends method and system
CN106611052A (en) Text label determination method and device
CN102609407B (en) Fine-grained semantic detection method of harmful text contents in network
Ma et al. Course recommendation based on semantic similarity analysis
CN111104526A (en) Financial label extraction method and system based on keyword semantics
Zhu et al. Small-world phenomenon of keywords network based on complex network
Bojović et al. An overview of forestry journals in the period 2006–2010 as basis for ascertaining research trends
CN105740404A (en) Label association method and device
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN104572616A (en) Method and device for identifying text orientation
CN107291895B (en) Quick hierarchical document query method
Limsettho et al. Automatic unsupervised bug report categorization
CN109408743A (en) Text link embedding grammar
Jeong et al. Intellectual structure of biomedical informatics reflected in scholarly events
CN103886020A (en) Quick search method of real estate information
Huang et al. Identification of topic evolution: network analytics with piecewise linear representation and word embedding
KR101955244B1 (en) Method of evaluating paper and method of recommending expert
Ngo et al. Domain specific entity recognition with semantic-based deep learning approach
Jaiswal et al. Detecting spam e-mails using stop word TF-IDF and stemming algorithm with Naïve Bayes classifier on the multicore GPU.
Kalra et al. Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach
Réchauchère et al. An innovative methodological framework for analyzing existing scientific research on land-use change and associated environmental impacts
CN109344232A (en) A kind of public feelings information search method and terminal device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171010

CF01 Termination of patent right due to non-payment of annual fee