CN104715014B

CN104715014B - A kind of online topic detecting method of news

Info

Publication number: CN104715014B
Application number: CN201510039493.4A
Authority: CN
Inventors: 常会友; 路永和; 韦婷婷; 胡勇军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2015-01-26
Filing date: 2015-01-26
Publication date: 2017-10-10
Anticipated expiration: 2035-01-26
Also published as: CN104715014A

Abstract

The present invention discloses a kind of online topic detecting method of news, belongs to Computer Science and Technology field, it is contemplated that the network text for needing progress topic detection in internet, proposes a kind of more effective topic detecting method.It is main that initial clustering is carried out using X means algorithms come the text reached to certain amount or in a period of time by structure cluster buffering area, dual threshold thought (topic polymerization threshold value, topic barycenter update threshold value) is introduced, the drift of topic is effectively controlled and improves the effect of cluster.The effect that this method is obtained is superior to the Single Pass algorithms of classics in each evaluation index, more accurately identifies a need for the topic of topic detection.

Description

A kind of online topic detecting method of news

Technical field

The present invention relates to Computer Science and Technology field, more particularly, to a kind of online topic detection of Internet news Method.

Background technology

Topic detection (Topic Detection, TD) is topic detection and tracking (Topic Detection and Tracking, TDT) in one of five basic research tasks, it is mainly to detection and organization system unknown topic in advance Detected.TDT (Topic Detection and Tracking) project is by U.S. Department of Defense's advanced research projects agency (DARPA) subsidize, University of Massachusetts (University of Massachusetts), Univ Carnegie Mellon The project that (Carnegie Mellon University) and Dragon Systems companies joint are participated in.This project master If carrying out automated analysis to continuous news media's information, inscribed if detecting in the presence of it, and to having detected that Topic is tracked.The research of topic detection is opened under the background of TDT (Topic Detection and Tracking) project What exhibition was got up.For topic detection this task, Single-Pass algorithms are using relatively broad.Single-Pass is one Incremental Clustering Algorithm is planted, clustering mainly is carried out to text flow.The algorithm successively by the text reached with it is existing Cluster barycenter carry out Similarity Measure, if maximum Similarity value is more than or equal to a certain threshold value, the text is aggregated to phase In cluster like degree maximum, and recalculate the barycenter of the cluster；If maximum Similarity value is less than a certain threshold value, one is created newly Cluster, and the text is aggregated in the cluster newly created.

Hong Yu (evaluation and test of Hong Yu, Zhang Yu, Liu Ting topic detection and trackings and Review Study [J] Journal of Chinese Information Processings .2007,(06):71-87.) et al. the evaluation and test and research to topic detection and tracking are reviewed, describe topic detection with The main task and key technology of tracking, and its main language material and evaluating method.(Jia Ziyan, He Qing, the Zhang Junhai such as Jia Ziyan A kind of incident detection and tracing algorithm [J] Journal of Computer Research and Development .2004,41 (7) based on dynamic evolution model:1273- 1280.) the name entity such as name, place name for occurring in text is recognized, and according to the difference of its classification, gives difference Weight, finally use for reference Single-Pass Clusterings and design the incident detection and tracing algorithm of dynamic evolution model.Zhang Kuo (Zhang K,Zi J,Wu LG.New event detection based on indexing-tree and named entity[C].Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.2007:It is 215-22.) to be then based on χ 2 Distribution is counted to each entity class in corpus and the relevance of each topic classification, and is each according to the relevance of feature Individual feature assigns different weights.(Zhao Hua, Zhao Tiejun study the high skills of [J] to Zhao Hua in the topic detection of great towards dynamic evolution Art communication .2006,12 (16):1230-1235.) then the factor of sequential is taken into account, the border that topic develops is known Not.Gold bead (go away for some great undertakings, Topic Tracking and tendentiousness sort research [J] information journal of the Zhao Jing based on HowNet by gold bead, woods .2005,5(24):It is 555-561.) then to set up structuring topic model using HowNet, topic is entered from different sides Row description.

Synthesis has been researched and analysed, and presently mainly carries out cluster to find topic using Single-Pass algorithms. Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time series reached according to news, once A news is read to carry out incremental clustering analysis.But, such processing can bring a problem：The dynamic clustering stage by There is no any other text as reference when in feature extraction, text-processing seems excessively single, and then cause each topic Barycenter can the orders read in of Yin Wenben it is different and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass is calculated Method is that a single threshold value is come the topic belonging to dividing text according to as defined in advance during text polymerize with topic, very It is easily caused the drift of topic.

The content of the invention

The present invention is directed to propose a kind of online topic detecting method of more effective news, by introduce cluster buffering area come pair Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, can more accurately identify a certain field News associated topic.

To achieve these goals, the technical scheme is that：

A kind of online topic detecting method of news, is the online topic for detecting news, specifically includes：

Initialization：Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic gathers Threshold value lowTX is closed, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic

Stage one, initial static cluster：

S1. pre-process：According to the time sequencing of news briefing, the news or units issued in unit interval are read The news of amount, and these newsletter archives are pre-processed, by newsletter archive vectorization；

S2. X-means algorithms progress initial static cluster is used to the newsletter archive newly read in；

S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and it is initial to calculate each The barycenter of class, is added it in initial classes barycenter list CentroidList_Cluster；

Stage two, dynamic clustering：

S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic matter Each topic barycenter in heart list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and its Corresponding topic；

S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then creates new topic, this is first Beginning Type of Collective is added to topic barycenter list into newly-built topic, and by the barycenter of the initial classes as the barycenter of newly-built topic In CentroidList_Topic；It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial Type of Collective is into the maximum topic of similarity；Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter HighTX, then update the topic barycenter after polymerization；

S6. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial classes matter Deleted in heart list CentroidList_Cluster；

S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, once cluster point is completed Analysis；

S8. the newsletter archive of next unit interval or Board Lot is waited, S1 is gone to step.

Preferably, the pretreatment detailed process of the step S1 is：To the news or unit issued in unit interval The news of quantity is carried out after participle and part-of-speech tagging, extracts news element and field name entity, and according to news element and neck The diverse location that domain name entity occurs assigns different weights W_1i, i represent news element and field name entity sets in i-th Individual；By weights W_1iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by, each is obtained The weight of news element or field name entity in the text, by newsletter archive vectorization；Wherein name entity in field refers to need The specific term in topic detection domain is wanted, news element generally refers to the personages such as source, time, place, the event of news.

Preferably, the pretreatment of the step S1 also includes the news that will occur simultaneously in headline and body Element or field name entity are identified, and weights W is assigned again_2i, weights W_1i、W_2iCorrespondence news element or neck are multiplied by respectively The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text Weight, by newsletter archive vectorization, wherein W_1i<W_2i。

Preferably, the pretreatment of the step S1 also includes the selection of feature, by all weights W_1i、W_2iCarry out by size The weights of this T feature are multiplied by correspondence news element or entity are named in field by sequence, T big feature of selection weights respectively The number of times occurred in Internet news, obtains the weight of each news element or field name entity in the text, by news text This vectorization.

Preferably, it is to the mode that initial classes barycenter and topic barycenter carry out Similarity Measure in step S4：Seek initial classes The cosine value of angle between two vectors of barycenter and topic barycenter, cosine value is bigger, it was demonstrated that then similarity is also bigger.

Compared with prior art, the beneficial effect of technical solution of the present invention is：

(1) in traditional topic detecting method, Text Pretreatment part is only merely a general text-processing mode, The text of a certain specific area or particular form is not distinguish between so as to excavate some special properties in the presence of it, Cause the missing that text semantic is represented.The present invention newsletter archive is pre-processed, excavation news essential element (time, Point, event, personage) and the domain class in specific term (i.e. field name entity), to each feature (news in news Element and field name entity) corresponding weight is assigned according to significance level, the degree of accuracy of feature extraction is improved, is effectively improved The quality of Text Pretreatment.

(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection. Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem：Dynamic is poly- The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic Affiliated topic, it is easy to cause the drift of topic.The present invention is directed to propose a kind of more effective topic detecting method, by drawing Enter to cluster buffering area and carry out initial clustering using X-means algorithms come the text reached to certain amount or in a period of time, Dual threshold thought (set up topic polymerization threshold value, topic barycenter and update threshold value) is introduced, the drift of topic is effectively controlled and improves poly- The effect of class.The effect that this method is obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately Identify food security associated topic.

Brief description of the drawings

Fig. 1 is the online topic detection flow chart that the present invention is clustered based on XMSP second orders.

Embodiment

The present invention is described further below in conjunction with the accompanying drawings, in the present embodiment with the news topic of food security Exemplified by detection.

Such as Fig. 1, a kind of online topic detecting method of news carries out the pretreatment stage of newsletter archive, by participle first After part-of-speech tagging, the identification news element field name entity related to topic detection field assigns corresponding weights, then Text vector is represented after the screening and weighting that carry out feature, clustering phase is finally sent into.In clustering phase, first carry out initial Static cluster, recycle proposed method to carry out dynamic secondary cluster.Final output result is one by one with one Or the news topic of a plurality of news composition.

Its XMSP second order clustering algorithm based on Single-Pass algorithm improvements is comprised the following steps that：

Initialization：Parameter maxNumClusters and minNumClusters in X-means, topic polymerization threshold value LowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_ Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic.

Input：In unit interval or Board Lot newsletter archive set

Output：Topic set (TopicSet)

Stage one, initial static cluster：

Step1. according to the time sequencing of news briefing, the news issued in unit interval (one day) is read (or single The news of bit quantity), and these newsletter archives are pre-processed, convert the text to vector space；

Step2. X-means algorithms progress initial static cluster is used to this collection of newsletter archive newly read in；

Step3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and at the beginning of calculating each The barycenter of beginning class, is added it in initial classes barycenter list CentroidList_Cluster；

Stage two, dynamic clustering：

Step4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic Each topic barycenter in barycenter list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and Topic corresponding to it.

If the maximum similarity value Step5. drawn is less than topic and polymerize threshold value lowTX, new topic is created, by this Initial classes are aggregated in newly-built topic, and the barycenter by the barycenter of the initial classes as newly-built topic is added to In CentroidList_Topic.

If the Similarity value Step6. drawn is more than or equal to topic and polymerize threshold value lowTX, by the initial Type of Collective Into the maximum topic of similarity.

If the Similarity value Step7. drawn, which is more than or equal to topic barycenter, updates threshold value highTX, update after polymerization Topic barycenter.

Step8. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial Deleted in class barycenter list CentroidList_Cluster.

Step9. Step4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky.

Step10. a clustering is completed, the newsletter archive of next unit interval (or Board Lot) is waited Reach, go to step Step1.

In the present embodiment, the topic detecting method of the present invention is applied to the detection of food security, it is specific as follows：

(1) the related element of news of food security and name entity are extracted, and is assigned not according to the diverse location of its appearance Same weighting, increases the feature weight related to text subject, reduces the weight of uncorrelated features, improves the accurate of text representation Degree.Comprise the following steps that：

1) introduce Chinese Academy of Sciences's Words partition system (ICTCLAS) and combine field of food safety name entity storehouse (including Common food name and the harmful substance title for threatening food security), to the name appeared in body, place name, Institution term and food security name entity are identified, and give its weight W_1i。

2) to both occurring in headline, and the name entity occurred in body is identified, and assigns again Its weight W_2i。

3) W is utilized_1i、W_2iWord frequency (number of times that word occurs in the text) is multiplied by calculate the weight of each word in the text.

(2) present invention carries out softening processing to Single-Pass algorithms, it is proposed that XMSP second order clustering algorithms.The algorithm By introducing the cluster buffering area based on X-means algorithms come the cluster to Single-Pass algorithms one text of single treatment Mode carries out softening, is controlled beneficial to dual thresholds come polymerization respectively to topic and the renewal of topic barycenter, so that flexible Change the problem of processing threshold value is set.XMSP second orders clustering algorithm includes two clustering stages, i.e. initial static clustering phase With the dynamic clustering stage.

1) initial static clustering phase is to carry out cluster point by the X-means algorithms newsletter archive collection static to one Analysis.By setting up a news buffering area, in unit interval or Board Lot news in advance carry out an initial static Cluster, and no longer it is the news progress clustering of one one, so as to effectively alleviate news order of arrival to cluster point Influence caused by analysis.

2) the dynamic clustering stage is still the thought for having used for reference Single-Pass algorithms, but with Single-Pass algorithms Barycenter adjustable strategies are different, and the XMSP dynamic clustering stages introduce two similarity thresholds：Topic polymerize threshold value lowTX and Topic barycenter updates threshold value highTX (wherein, lowTX<highTX).Some initial classes are obtained by initial static cluster, then will The barycenter of the barycenter of these initial classes and already present each topic carries out Similarity Measure.When a certain initial classes and each topic barycenter Similarity maximum be more than topic polymerize threshold value lowTX when, then by the initial classes and similarity maximum topic polymerize. But this does not imply that also topic barycenter is updated, when only similarity is more than topic barycenter renewal threshold value highTX, Just the topic barycenter being polymerize is updated.

Embodiment 1

(1) analysis of cases corpus

The news report of the relevant food security collected from major Internet news media, the time of collection is 2012 November in January, 2013 in year.The corpus is made up of 1034 news about food security, totally 11 topics.These topics Respectively gutter oil, gold rice, Jiu Gui Jiu, rapid-result chicken, clenbuterol hydrochloride, U.S.A praise minister milk powder, ordeal bean bud, bright milk, gelatin fish Wing, edible oil be carcinogenic and running water safety.News record and Annual distribution contained by each topic is as shown in table 1 below.

The food security topic corpus of table 1

(2) evaluation method

Take the metric form of supervision to be estimated systematic function, that is, measure cluster label corresponding with topic label Degree.Wherein, cluster label refer to system according to clustering to the label given by a certain piece news, topic label is artificial root According to the label given by the true classification belonging to news.Here by count in each cluster each contained topic article number come The article number comprising topic j in corresponding relation, such as cluster i, which is set up, for cluster and topic at most, i.e., cluster i is labeled as topic j.

Wherein, n_ijRefer to that the news in topic j is divided into cluster i article number, n by system_iRefer to text included in cluster i Chapter number, n_jRefer to the article number included in topic j, n is the sum of article.Each topic is obtained by setting up corresponding relation Loss P_Miss(i, j) and rate of false alarm P_Fa(i, j), the final missing inspection of the system that draws of being averaged to the result of calculation of all topics Rate P_MissWith rate of false alarm P_Fa, average loss P_MissWith average rate of false alarm P_FaCalculation formula is as follows：

Initial point distance (Distance from Origin), the index is the comprehensive assessment to loss and rate of false alarm, its The performance of the lower expression system of value is better.The calculation formula of initial point distance index is as follows：

In addition to loss and rate of false alarm used in topic detection and tracking, traditional clustering algorithm evaluation refers to Mark includes precision ratio (Precision) Precision (i, j), recall rate (Recall) Recall (i, j) and F examines (F- Measure) F (i, j), the calculation formula of these three indexs is as follows：

Wherein, n_ijRefer in cluster i comprising the article number for belonging to class j, n_iRefer to article number included in cluster i, n_jRefer to Article number included in class j.

As asking loss and rate of false alarm, it is established that cluster and the corresponding relation of class, all kinds of precision ratios, recall rate are calculated And F is examined, and finally all kinds of drawn indexs are averaged, these three evaluation indexes of system are obtained, calculation formula is such as Under^[31]：

Wherein, k is topic number, R_jRefer to the recall rate for class j, P_jRefer to the precision ratio for class j, F_jRefer to for Class j F test values.

(3) effect analysis

1) Experimental comparison of the Text Pretreatment method to topic Detection results：General pretreatment strategy (participle, goes to stop Word, tfidf weightings), introduce the identification weighting technique that news element and food security name entity.Clustered using this paper XMSP Method, the effect of topic detection is as follows：

Table 2 adds the operational effect contrast for news element and name entity being identified rear XMSP

2) cluster buffering area effect is introduced to compare：Threshold value (highTX) is updated to introduce topic barycenter in control XMSP algorithms Topic barycenter in XMSP algorithms is updated threshold value (highTX) and polymerize threshold value with topic by the influence to the experiment, the experiment (lowTX) it is arranged to same numerical value.Single-Pass algorithms in topic polymerization stage are come as a unit with a text Polymerize, its granularity is smaller, the span that threshold value TX takes be 0.02~0.24 between.Because when threshold value TX takes When 0.24, the cluster obtained by clustering has had arrived at 146, if threshold value TX takes higher value that more clusters only occur, this The value that sample is not also just counted.And XMSP algorithms are different from Single-Pass, it be with an initial classes with it is already present Topic is polymerize, and its fineness ratio is larger, thus XMSP algorithms threshold value lowTX (wherein, lowTX=highTX) be 0.33~ Best effect is just obtained when 0.35.When the two algorithms reach best effects, the contrast of each index is as shown in table 3.This is also tested The validity that cluster buffering area is added in XMSP algorithms is demonstrate,proved.

The optimal effectiveness that table 3 adds the XMSP algorithms and Single-Pass algorithms of cluster buffering area is compared

3) topic barycenter renewal threshold effect is introduced to compare：

First the topic polymerization threshold value lowTX in XMSP algorithms is fixedly installed as some value for this experiment.By adjusting words Topic barycenter updates threshold value highTX to observe the influence for introducing the threshold value to XMSP algorithm effects.Table 4 and table 5 are respectively lowTX During=0.33 and lowTX=0.42, the XMSP algorithm operational effects corresponding to different highTX.

Table 4 XMSP algorithms (lowTX=0.33) operational effect

Table 5 XMSP algorithms (lowTX=0.42) operational effect

By table 4 and table 5 as can be seen that updating threshold value highTX by introducing topic barycenter, and its value is adjusted, XMSP is calculated Method can obtain more preferable effect.Wherein, as lowTX=0.33, highTX values are at 0.45~0.47, and XMSP algorithms are obtained Best effect；As lowTX=0.42, highTX values are at 0.46~0.52, and XMSP algorithms obtain best effect.This Sample, which has also been turned out, introduces the validity that topic barycenter updates threshold value.

This implementation is pre-processed for the related Internet news text of food security, excavates the essential element of news (time, place, event, personage), corresponding weight is assigned according to significance level, improves the degree of accuracy of feature extraction.Food is pacified The characteristics of brand-new news maximum, is that the content described by it is the keyword of that is, one food security news based on food security Mainly it is made up of some specific terms (field of food safety name entity) in field of food safety, and these names are real Expression of the body to information contained in news has highly important effect.Therefore, the food security appeared in news is named Entity, which is identified and assigns its corresponding weight, can more efficiently improve the quality of Text Pretreatment.

(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection. Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem：Dynamic is poly- The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic Affiliated topic, it is easy to cause the drift of topic.The topic detecting method of the present embodiment, by introduce cluster buffering area come pair Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately identifies food security phase Close topic.

Claims

1. a kind of online topic detecting method of news, it is characterised in that specifically include：

Initialization：Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic polymerization threshold Value lowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic

Stage one, initial static cluster：

S1. pre-process：According to the time sequencing of news briefing, the news issued in unit interval or Board Lot are read News, and these newsletter archives are pre-processed, by newsletter archive vectorization；

S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and calculate each initial classes Barycenter, is added it in initial classes barycenter list CentroidList_Cluster；

Stage two, dynamic clustering：

S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, is arranged with topic barycenter Each topic barycenter in table CentroidList_Topic carries out Similarity Measure, and record maximum similarity value and its institute are right The topic answered；

S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then new topic is created, by the initial classes It is aggregated in newly-built topic, and the barycenter of the initial classes is added to topic barycenter list as the barycenter of newly-built topic In CentroidList_Topic；It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial Type of Collective is into the maximum topic of similarity；Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter HighTX, then update the topic barycenter after polymerization；

S6. the initial classes and its corresponding class barycenter are arranged from initial classes set ClusterSet and initial classes barycenter respectively Deleted in table CentroidList_Cluster；

S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, a clustering is completed；

2. the online topic detecting method of news according to claim 1, it is characterised in that the pretreatment tool of the step S1 Body process is：The news or the news of Board Lot issued in unit interval are carried out after participle and part-of-speech tagging, extracted News element and field name entity, and the power different with the diverse location imparting that field names entity to occur according to news element Value W_1i, i represent news element and field name entity sets in i-th individual；By weights W_1iIt is multiplied by correspondence news element or neck The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text Weight, by newsletter archive vectorization；Wherein name entity in field refers to need the specific term in topic detection domain.

3. the online topic detecting method of news according to claim 2, it is characterised in that the pretreatment of the step S1 is also Including the news element occurred in headline and body or field name entity are identified simultaneously, assign again Weights W_2i, weights W_1i、W_2iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by respectively, is obtained Each news element or field is taken to name the weight of entity in the text, by newsletter archive vectorization, wherein W_1i<W_2i。

4. the online topic detecting method of news according to claim 3, it is characterised in that the pretreatment of the step S1 is also Selection including feature, by all weights W_1i、W_2iIt is ranked up by size, selects T feature of maximum weight, it is special by this T The weights levied are multiplied by the number of times that correspondence news element or field name entity occur in Internet news respectively, obtain each news The weight of element or field name entity in the text, by newsletter archive vectorization.

5. the online topic detecting method of news according to claim 4, it is characterised in that to initial classes barycenter in step S4 With topic barycenter carry out Similarity Measure mode be：Ask and press from both sides cosine of an angle between two vectors of initial classes barycenter and topic barycenter Value, cosine value is bigger, then similarity is also bigger.