CN104715014B - A kind of online topic detecting method of news - Google Patents
A kind of online topic detecting method of news Download PDFInfo
- Publication number
- CN104715014B CN104715014B CN201510039493.4A CN201510039493A CN104715014B CN 104715014 B CN104715014 B CN 104715014B CN 201510039493 A CN201510039493 A CN 201510039493A CN 104715014 B CN104715014 B CN 104715014B
- Authority
- CN
- China
- Prior art keywords
- topic
- news
- barycenter
- cluster
- initial classes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000001514 detection method Methods 0.000 claims abstract description 30
- 238000006116 polymerization reaction Methods 0.000 claims abstract description 14
- 230000003068 static effect Effects 0.000 claims description 15
- 238000011524 similarity measure Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 28
- 230000003139 buffering effect Effects 0.000 abstract description 9
- 238000011156 evaluation Methods 0.000 abstract description 9
- 230000009977 dual effect Effects 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 abstract description 4
- 235000021393 food security Nutrition 0.000 description 15
- 238000004458 analytical method Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 235000013305 food Nutrition 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 3
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 3
- 239000010931 gold Substances 0.000 description 3
- 229910052737 gold Inorganic materials 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 239000011324 bead Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 235000013336 milk Nutrition 0.000 description 2
- 239000008267 milk Substances 0.000 description 2
- 210000004080 milk Anatomy 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 108010010803 Gelatin Proteins 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 241000418087 Physostigma venenosum Species 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000711 cancerogenic effect Effects 0.000 description 1
- 231100000315 carcinogenic Toxicity 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 229960001399 clenbuterol hydrochloride Drugs 0.000 description 1
- OPXKTCUYRHXSBK-UHFFFAOYSA-N clenbuterol hydrochloride Chemical compound Cl.CC(C)(C)NCC(O)C1=CC(Cl)=C(N)C(Cl)=C1 OPXKTCUYRHXSBK-UHFFFAOYSA-N 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000008157 edible vegetable oil Substances 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 229920000159 gelatin Polymers 0.000 description 1
- 239000008273 gelatin Substances 0.000 description 1
- 235000019322 gelatine Nutrition 0.000 description 1
- 235000011852 gelatine desserts Nutrition 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000003921 oil Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of online topic detecting method of news, belongs to Computer Science and Technology field, it is contemplated that the network text for needing progress topic detection in internet, proposes a kind of more effective topic detecting method.It is main that initial clustering is carried out using X means algorithms come the text reached to certain amount or in a period of time by structure cluster buffering area, dual threshold thought (topic polymerization threshold value, topic barycenter update threshold value) is introduced, the drift of topic is effectively controlled and improves the effect of cluster.The effect that this method is obtained is superior to the Single Pass algorithms of classics in each evaluation index, more accurately identifies a need for the topic of topic detection.
Description
Technical field
The present invention relates to Computer Science and Technology field, more particularly, to a kind of online topic detection of Internet news
Method.
Background technology
Topic detection (Topic Detection, TD) is topic detection and tracking (Topic Detection and
Tracking, TDT) in one of five basic research tasks, it is mainly to detection and organization system unknown topic in advance
Detected.TDT (Topic Detection and Tracking) project is by U.S. Department of Defense's advanced research projects agency
(DARPA) subsidize, University of Massachusetts (University of Massachusetts), Univ Carnegie Mellon
The project that (Carnegie Mellon University) and Dragon Systems companies joint are participated in.This project master
If carrying out automated analysis to continuous news media's information, inscribed if detecting in the presence of it, and to having detected that
Topic is tracked.The research of topic detection is opened under the background of TDT (Topic Detection and Tracking) project
What exhibition was got up.For topic detection this task, Single-Pass algorithms are using relatively broad.Single-Pass is one
Incremental Clustering Algorithm is planted, clustering mainly is carried out to text flow.The algorithm successively by the text reached with it is existing
Cluster barycenter carry out Similarity Measure, if maximum Similarity value is more than or equal to a certain threshold value, the text is aggregated to phase
In cluster like degree maximum, and recalculate the barycenter of the cluster;If maximum Similarity value is less than a certain threshold value, one is created newly
Cluster, and the text is aggregated in the cluster newly created.
Hong Yu (evaluation and test of Hong Yu, Zhang Yu, Liu Ting topic detection and trackings and Review Study [J] Journal of Chinese Information Processings
.2007,(06):71-87.) et al. the evaluation and test and research to topic detection and tracking are reviewed, describe topic detection with
The main task and key technology of tracking, and its main language material and evaluating method.(Jia Ziyan, He Qing, the Zhang Junhai such as Jia Ziyan
A kind of incident detection and tracing algorithm [J] Journal of Computer Research and Development .2004,41 (7) based on dynamic evolution model:1273-
1280.) the name entity such as name, place name for occurring in text is recognized, and according to the difference of its classification, gives difference
Weight, finally use for reference Single-Pass Clusterings and design the incident detection and tracing algorithm of dynamic evolution model.Zhang Kuo
(Zhang K,Zi J,Wu LG.New event detection based on indexing-tree and named
entity[C].Proceedings of the 30th annual international ACM SIGIR conference
on Research and development in information retrieval.2007:It is 215-22.) to be then based on χ 2
Distribution is counted to each entity class in corpus and the relevance of each topic classification, and is each according to the relevance of feature
Individual feature assigns different weights.(Zhao Hua, Zhao Tiejun study the high skills of [J] to Zhao Hua in the topic detection of great towards dynamic evolution
Art communication .2006,12 (16):1230-1235.) then the factor of sequential is taken into account, the border that topic develops is known
Not.Gold bead (go away for some great undertakings, Topic Tracking and tendentiousness sort research [J] information journal of the Zhao Jing based on HowNet by gold bead, woods
.2005,5(24):It is 555-561.) then to set up structuring topic model using HowNet, topic is entered from different sides
Row description.
Synthesis has been researched and analysed, and presently mainly carries out cluster to find topic using Single-Pass algorithms.
Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time series reached according to news, once
A news is read to carry out incremental clustering analysis.But, such processing can bring a problem:The dynamic clustering stage by
There is no any other text as reference when in feature extraction, text-processing seems excessively single, and then cause each topic
Barycenter can the orders read in of Yin Wenben it is different and produce very big difference, influence Clustering Effect.Meanwhile, Single-Pass is calculated
Method is that a single threshold value is come the topic belonging to dividing text according to as defined in advance during text polymerize with topic, very
It is easily caused the drift of topic.
The content of the invention
The present invention is directed to propose a kind of online topic detecting method of more effective news, by introduce cluster buffering area come pair
Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build
Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes
The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, can more accurately identify a certain field
News associated topic.
To achieve these goals, the technical scheme is that:
A kind of online topic detecting method of news, is the online topic for detecting news, specifically includes:
Initialization:Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic gathers
Threshold value lowTX is closed, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list
CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic
Stage one, initial static cluster:
S1. pre-process:According to the time sequencing of news briefing, the news or units issued in unit interval are read
The news of amount, and these newsletter archives are pre-processed, by newsletter archive vectorization;
S2. X-means algorithms progress initial static cluster is used to the newsletter archive newly read in;
S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and it is initial to calculate each
The barycenter of class, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic matter
Each topic barycenter in heart list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and its
Corresponding topic;
S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then creates new topic, this is first
Beginning Type of Collective is added to topic barycenter list into newly-built topic, and by the barycenter of the initial classes as the barycenter of newly-built topic
In CentroidList_Topic;It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial
Type of Collective is into the maximum topic of similarity;Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter
HighTX, then update the topic barycenter after polymerization;
S6. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial classes matter
Deleted in heart list CentroidList_Cluster;
S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, once cluster point is completed
Analysis;
S8. the newsletter archive of next unit interval or Board Lot is waited, S1 is gone to step.
Preferably, the pretreatment detailed process of the step S1 is:To the news or unit issued in unit interval
The news of quantity is carried out after participle and part-of-speech tagging, extracts news element and field name entity, and according to news element and neck
The diverse location that domain name entity occurs assigns different weights W1i, i represent news element and field name entity sets in i-th
Individual;By weights W1iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by, each is obtained
The weight of news element or field name entity in the text, by newsletter archive vectorization;Wherein name entity in field refers to need
The specific term in topic detection domain is wanted, news element generally refers to the personages such as source, time, place, the event of news.
Preferably, the pretreatment of the step S1 also includes the news that will occur simultaneously in headline and body
Element or field name entity are identified, and weights W is assigned again2i, weights W1i、W2iCorrespondence news element or neck are multiplied by respectively
The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text
Weight, by newsletter archive vectorization, wherein W1i<W2i。
Preferably, the pretreatment of the step S1 also includes the selection of feature, by all weights W1i、W2iCarry out by size
The weights of this T feature are multiplied by correspondence news element or entity are named in field by sequence, T big feature of selection weights respectively
The number of times occurred in Internet news, obtains the weight of each news element or field name entity in the text, by news text
This vectorization.
Preferably, it is to the mode that initial classes barycenter and topic barycenter carry out Similarity Measure in step S4:Seek initial classes
The cosine value of angle between two vectors of barycenter and topic barycenter, cosine value is bigger, it was demonstrated that then similarity is also bigger.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
(1) in traditional topic detecting method, Text Pretreatment part is only merely a general text-processing mode,
The text of a certain specific area or particular form is not distinguish between so as to excavate some special properties in the presence of it,
Cause the missing that text semantic is represented.The present invention newsletter archive is pre-processed, excavation news essential element (time,
Point, event, personage) and the domain class in specific term (i.e. field name entity), to each feature (news in news
Element and field name entity) corresponding weight is assigned according to significance level, the degree of accuracy of feature extraction is improved, is effectively improved
The quality of Text Pretreatment.
(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection.
Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news
Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem:Dynamic is poly-
The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference
Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile,
Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic
Affiliated topic, it is easy to cause the drift of topic.The present invention is directed to propose a kind of more effective topic detecting method, by drawing
Enter to cluster buffering area and carry out initial clustering using X-means algorithms come the text reached to certain amount or in a period of time,
Dual threshold thought (set up topic polymerization threshold value, topic barycenter and update threshold value) is introduced, the drift of topic is effectively controlled and improves poly-
The effect of class.The effect that this method is obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately
Identify food security associated topic.
Brief description of the drawings
Fig. 1 is the online topic detection flow chart that the present invention is clustered based on XMSP second orders.
Embodiment
The present invention is described further below in conjunction with the accompanying drawings, in the present embodiment with the news topic of food security
Exemplified by detection.
Such as Fig. 1, a kind of online topic detecting method of news carries out the pretreatment stage of newsletter archive, by participle first
After part-of-speech tagging, the identification news element field name entity related to topic detection field assigns corresponding weights, then
Text vector is represented after the screening and weighting that carry out feature, clustering phase is finally sent into.In clustering phase, first carry out initial
Static cluster, recycle proposed method to carry out dynamic secondary cluster.Final output result is one by one with one
Or the news topic of a plurality of news composition.
Its XMSP second order clustering algorithm based on Single-Pass algorithm improvements is comprised the following steps that:
Initialization:Parameter maxNumClusters and minNumClusters in X-means, topic polymerization threshold value
LowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list CentroidList_
Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic.
Input:In unit interval or Board Lot newsletter archive set
Output:Topic set (TopicSet)
Stage one, initial static cluster:
Step1. according to the time sequencing of news briefing, the news issued in unit interval (one day) is read (or single
The news of bit quantity), and these newsletter archives are pre-processed, convert the text to vector space;
Step2. X-means algorithms progress initial static cluster is used to this collection of newsletter archive newly read in;
Step3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and at the beginning of calculating each
The barycenter of beginning class, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
Step4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, with topic
Each topic barycenter in barycenter list CentroidList_Topic carries out Similarity Measure, record maximum similarity value and
Topic corresponding to it.
If the maximum similarity value Step5. drawn is less than topic and polymerize threshold value lowTX, new topic is created, by this
Initial classes are aggregated in newly-built topic, and the barycenter by the barycenter of the initial classes as newly-built topic is added to
In CentroidList_Topic.
If the Similarity value Step6. drawn is more than or equal to topic and polymerize threshold value lowTX, by the initial Type of Collective
Into the maximum topic of similarity.
If the Similarity value Step7. drawn, which is more than or equal to topic barycenter, updates threshold value highTX, update after polymerization
Topic barycenter.
Step8. by the initial classes and its corresponding class barycenter respectively from initial classes set ClusterSet and initial
Deleted in class barycenter list CentroidList_Cluster.
Step9. Step4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky.
Step10. a clustering is completed, the newsletter archive of next unit interval (or Board Lot) is waited
Reach, go to step Step1.
In the present embodiment, the topic detecting method of the present invention is applied to the detection of food security, it is specific as follows:
(1) the related element of news of food security and name entity are extracted, and is assigned not according to the diverse location of its appearance
Same weighting, increases the feature weight related to text subject, reduces the weight of uncorrelated features, improves the accurate of text representation
Degree.Comprise the following steps that:
1) introduce Chinese Academy of Sciences's Words partition system (ICTCLAS) and combine field of food safety name entity storehouse (including
Common food name and the harmful substance title for threatening food security), to the name appeared in body, place name,
Institution term and food security name entity are identified, and give its weight W1i。
2) to both occurring in headline, and the name entity occurred in body is identified, and assigns again
Its weight W2i。
3) W is utilized1i、W2iWord frequency (number of times that word occurs in the text) is multiplied by calculate the weight of each word in the text.
(2) present invention carries out softening processing to Single-Pass algorithms, it is proposed that XMSP second order clustering algorithms.The algorithm
By introducing the cluster buffering area based on X-means algorithms come the cluster to Single-Pass algorithms one text of single treatment
Mode carries out softening, is controlled beneficial to dual thresholds come polymerization respectively to topic and the renewal of topic barycenter, so that flexible
Change the problem of processing threshold value is set.XMSP second orders clustering algorithm includes two clustering stages, i.e. initial static clustering phase
With the dynamic clustering stage.
1) initial static clustering phase is to carry out cluster point by the X-means algorithms newsletter archive collection static to one
Analysis.By setting up a news buffering area, in unit interval or Board Lot news in advance carry out an initial static
Cluster, and no longer it is the news progress clustering of one one, so as to effectively alleviate news order of arrival to cluster point
Influence caused by analysis.
2) the dynamic clustering stage is still the thought for having used for reference Single-Pass algorithms, but with Single-Pass algorithms
Barycenter adjustable strategies are different, and the XMSP dynamic clustering stages introduce two similarity thresholds:Topic polymerize threshold value lowTX and
Topic barycenter updates threshold value highTX (wherein, lowTX<highTX).Some initial classes are obtained by initial static cluster, then will
The barycenter of the barycenter of these initial classes and already present each topic carries out Similarity Measure.When a certain initial classes and each topic barycenter
Similarity maximum be more than topic polymerize threshold value lowTX when, then by the initial classes and similarity maximum topic polymerize.
But this does not imply that also topic barycenter is updated, when only similarity is more than topic barycenter renewal threshold value highTX,
Just the topic barycenter being polymerize is updated.
Embodiment 1
(1) analysis of cases corpus
The news report of the relevant food security collected from major Internet news media, the time of collection is 2012
November in January, 2013 in year.The corpus is made up of 1034 news about food security, totally 11 topics.These topics
Respectively gutter oil, gold rice, Jiu Gui Jiu, rapid-result chicken, clenbuterol hydrochloride, U.S.A praise minister milk powder, ordeal bean bud, bright milk, gelatin fish
Wing, edible oil be carcinogenic and running water safety.News record and Annual distribution contained by each topic is as shown in table 1 below.
The food security topic corpus of table 1
(2) evaluation method
Take the metric form of supervision to be estimated systematic function, that is, measure cluster label corresponding with topic label
Degree.Wherein, cluster label refer to system according to clustering to the label given by a certain piece news, topic label is artificial root
According to the label given by the true classification belonging to news.Here by count in each cluster each contained topic article number come
The article number comprising topic j in corresponding relation, such as cluster i, which is set up, for cluster and topic at most, i.e., cluster i is labeled as topic j.
Wherein, nijRefer to that the news in topic j is divided into cluster i article number, n by systemiRefer to text included in cluster i
Chapter number, njRefer to the article number included in topic j, n is the sum of article.Each topic is obtained by setting up corresponding relation
Loss PMiss(i, j) and rate of false alarm PFa(i, j), the final missing inspection of the system that draws of being averaged to the result of calculation of all topics
Rate PMissWith rate of false alarm PFa, average loss PMissWith average rate of false alarm PFaCalculation formula is as follows:
Initial point distance (Distance from Origin), the index is the comprehensive assessment to loss and rate of false alarm, its
The performance of the lower expression system of value is better.The calculation formula of initial point distance index is as follows:
In addition to loss and rate of false alarm used in topic detection and tracking, traditional clustering algorithm evaluation refers to
Mark includes precision ratio (Precision) Precision (i, j), recall rate (Recall) Recall (i, j) and F examines (F-
Measure) F (i, j), the calculation formula of these three indexs is as follows:
Wherein, nijRefer in cluster i comprising the article number for belonging to class j, niRefer to article number included in cluster i, njRefer to
Article number included in class j.
As asking loss and rate of false alarm, it is established that cluster and the corresponding relation of class, all kinds of precision ratios, recall rate are calculated
And F is examined, and finally all kinds of drawn indexs are averaged, these three evaluation indexes of system are obtained, calculation formula is such as
Under[31]:
Wherein, k is topic number, RjRefer to the recall rate for class j, PjRefer to the precision ratio for class j, FjRefer to for
Class j F test values.
(3) effect analysis
1) Experimental comparison of the Text Pretreatment method to topic Detection results:General pretreatment strategy (participle, goes to stop
Word, tfidf weightings), introduce the identification weighting technique that news element and food security name entity.Clustered using this paper XMSP
Method, the effect of topic detection is as follows:
Table 2 adds the operational effect contrast for news element and name entity being identified rear XMSP
2) cluster buffering area effect is introduced to compare:Threshold value (highTX) is updated to introduce topic barycenter in control XMSP algorithms
Topic barycenter in XMSP algorithms is updated threshold value (highTX) and polymerize threshold value with topic by the influence to the experiment, the experiment
(lowTX) it is arranged to same numerical value.Single-Pass algorithms in topic polymerization stage are come as a unit with a text
Polymerize, its granularity is smaller, the span that threshold value TX takes be 0.02~0.24 between.Because when threshold value TX takes
When 0.24, the cluster obtained by clustering has had arrived at 146, if threshold value TX takes higher value that more clusters only occur, this
The value that sample is not also just counted.And XMSP algorithms are different from Single-Pass, it be with an initial classes with it is already present
Topic is polymerize, and its fineness ratio is larger, thus XMSP algorithms threshold value lowTX (wherein, lowTX=highTX) be 0.33~
Best effect is just obtained when 0.35.When the two algorithms reach best effects, the contrast of each index is as shown in table 3.This is also tested
The validity that cluster buffering area is added in XMSP algorithms is demonstrate,proved.
The optimal effectiveness that table 3 adds the XMSP algorithms and Single-Pass algorithms of cluster buffering area is compared
3) topic barycenter renewal threshold effect is introduced to compare:
First the topic polymerization threshold value lowTX in XMSP algorithms is fixedly installed as some value for this experiment.By adjusting words
Topic barycenter updates threshold value highTX to observe the influence for introducing the threshold value to XMSP algorithm effects.Table 4 and table 5 are respectively lowTX
During=0.33 and lowTX=0.42, the XMSP algorithm operational effects corresponding to different highTX.
Table 4 XMSP algorithms (lowTX=0.33) operational effect
Table 5 XMSP algorithms (lowTX=0.42) operational effect
By table 4 and table 5 as can be seen that updating threshold value highTX by introducing topic barycenter, and its value is adjusted, XMSP is calculated
Method can obtain more preferable effect.Wherein, as lowTX=0.33, highTX values are at 0.45~0.47, and XMSP algorithms are obtained
Best effect;As lowTX=0.42, highTX values are at 0.46~0.52, and XMSP algorithms obtain best effect.This
Sample, which has also been turned out, introduces the validity that topic barycenter updates threshold value.
This implementation is pre-processed for the related Internet news text of food security, excavates the essential element of news
(time, place, event, personage), corresponding weight is assigned according to significance level, improves the degree of accuracy of feature extraction.Food is pacified
The characteristics of brand-new news maximum, is that the content described by it is the keyword of that is, one food security news based on food security
Mainly it is made up of some specific terms (field of food safety name entity) in field of food safety, and these names are real
Expression of the body to information contained in news has highly important effect.Therefore, the food security appeared in news is named
Entity, which is identified and assigns its corresponding weight, can more efficiently improve the quality of Text Pretreatment.
(2) traditional clustering method easily causes topic drift and the not good problem of Clustering Effect for topic event detection.
Traditional Single-Pass algorithms are a classical Incremental Clustering Algorithms, and the algorithm is the time sequence reached according to news
Row, once read a news to carry out incremental clustering analysis.But, such processing can bring a problem:Dynamic is poly-
The class stage, text-processing seemed excessively single, and then makes due to not having any other text in feature extraction as reference
Obtain the order difference of the barycenter meeting Yin Wenben readings of each topic and produce very big difference, influence Clustering Effect.Meanwhile,
Single-Pass algorithms are that a single threshold value according to as defined in advance divides text during text polymerize with topic
Affiliated topic, it is easy to cause the drift of topic.The topic detecting method of the present embodiment, by introduce cluster buffering area come pair
Certain amount or the text reached in a period of time carry out initial clustering using X-means algorithms, introduce dual threshold thought and (build
Vertical topic polymerization threshold value, topic barycenter update threshold value), effectively control the drift of topic and improve the effect of cluster.This method takes
The effect obtained is superior to the Single-Pass algorithms of classics in each evaluation index, more accurately identifies food security phase
Close topic.
Claims (5)
1. a kind of online topic detecting method of news, it is characterised in that specifically include:
Initialization:Default maximum clustering cluster maxNumClusters and min cluster cluster minNumClusters, topic polymerization threshold
Value lowTX, topic barycenter updates threshold value highTX, initial classes set ClusterSet, preliminary class barycenter list
CentroidList_Cluster, topic set TopicSet, topic barycenter list CentroidList_Topic
Stage one, initial static cluster:
S1. pre-process:According to the time sequencing of news briefing, the news issued in unit interval or Board Lot are read
News, and these newsletter archives are pre-processed, by newsletter archive vectorization;
S2. X-means algorithms progress initial static cluster is used to the newsletter archive newly read in;
S3. the initial classes obtained by static cluster are stored in initial classes set ClusterSet, and calculate each initial classes
Barycenter, is added it in initial classes barycenter list CentroidList_Cluster;
Stage two, dynamic clustering:
S4. an initial classes barycenter in initial classes barycenter list CentroidList_Cluster is taken out, is arranged with topic barycenter
Each topic barycenter in table CentroidList_Topic carries out Similarity Measure, and record maximum similarity value and its institute are right
The topic answered;
S5. it polymerize threshold value lowTX when the maximum similarity value drawn is less than topic, then new topic is created, by the initial classes
It is aggregated in newly-built topic, and the barycenter of the initial classes is added to topic barycenter list as the barycenter of newly-built topic
In CentroidList_Topic;It is when the Similarity value drawn is more than or equal to topic polymerization threshold value lowTX, then this is initial
Type of Collective is into the maximum topic of similarity;Threshold value is updated when the Similarity value drawn is more than or equal to topic barycenter
HighTX, then update the topic barycenter after polymerization;
S6. the initial classes and its corresponding class barycenter are arranged from initial classes set ClusterSet and initial classes barycenter respectively
Deleted in table CentroidList_Cluster;
S7. S4 is gone to step, until initial classes barycenter list CentroidList_Cluster is sky, a clustering is completed;
S8. the newsletter archive of next unit interval or Board Lot is waited, S1 is gone to step.
2. the online topic detecting method of news according to claim 1, it is characterised in that the pretreatment tool of the step S1
Body process is:The news or the news of Board Lot issued in unit interval are carried out after participle and part-of-speech tagging, extracted
News element and field name entity, and the power different with the diverse location imparting that field names entity to occur according to news element
Value W1i, i represent news element and field name entity sets in i-th individual;By weights W1iIt is multiplied by correspondence news element or neck
The number of times that domain name entity occurs in Internet news, obtains the power of each news element or field name entity in the text
Weight, by newsletter archive vectorization;Wherein name entity in field refers to need the specific term in topic detection domain.
3. the online topic detecting method of news according to claim 2, it is characterised in that the pretreatment of the step S1 is also
Including the news element occurred in headline and body or field name entity are identified simultaneously, assign again
Weights W2i, weights W1i、W2iThe number of times that correspondence news element or field name entity occur in Internet news is multiplied by respectively, is obtained
Each news element or field is taken to name the weight of entity in the text, by newsletter archive vectorization, wherein W1i<W2i。
4. the online topic detecting method of news according to claim 3, it is characterised in that the pretreatment of the step S1 is also
Selection including feature, by all weights W1i、W2iIt is ranked up by size, selects T feature of maximum weight, it is special by this T
The weights levied are multiplied by the number of times that correspondence news element or field name entity occur in Internet news respectively, obtain each news
The weight of element or field name entity in the text, by newsletter archive vectorization.
5. the online topic detecting method of news according to claim 4, it is characterised in that to initial classes barycenter in step S4
With topic barycenter carry out Similarity Measure mode be:Ask and press from both sides cosine of an angle between two vectors of initial classes barycenter and topic barycenter
Value, cosine value is bigger, then similarity is also bigger.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510039493.4A CN104715014B (en) | 2015-01-26 | 2015-01-26 | A kind of online topic detecting method of news |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510039493.4A CN104715014B (en) | 2015-01-26 | 2015-01-26 | A kind of online topic detecting method of news |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104715014A CN104715014A (en) | 2015-06-17 |
CN104715014B true CN104715014B (en) | 2017-10-10 |
Family
ID=53414341
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510039493.4A Expired - Fee Related CN104715014B (en) | 2015-01-26 | 2015-01-26 | A kind of online topic detecting method of news |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104715014B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488092B (en) * | 2015-07-13 | 2018-05-22 | 中国科学院信息工程研究所 | A kind of time-sensitive and adaptive sub-topic online test method and system |
CN105468669B (en) * | 2015-10-13 | 2019-05-21 | 中国科学院信息工程研究所 | A kind of adaptive microblog topic method for tracing merging customer relationship |
CN105320646A (en) * | 2015-11-17 | 2016-02-10 | 天津大学 | Incremental clustering based news topic mining method and apparatus thereof |
CN108021579B (en) * | 2016-10-28 | 2021-10-15 | 上海优扬新媒信息技术有限公司 | Information output method and device |
CN108062319A (en) * | 2016-11-08 | 2018-05-22 | 北京国双科技有限公司 | A kind of real-time detection method and device of new theme |
CN108170671A (en) * | 2017-12-19 | 2018-06-15 | 中山大学 | A kind of method for extracting media event time of origin |
CN108334628A (en) * | 2018-02-23 | 2018-07-27 | 北京东润环能科技股份有限公司 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
CN109657164B (en) * | 2018-12-25 | 2020-07-10 | 广州华多网络科技有限公司 | Method, device and storage medium for publishing message |
CN109902230A (en) * | 2019-02-13 | 2019-06-18 | 北京航空航天大学 | A kind of processing method and processing device of news data |
CN110866555A (en) * | 2019-11-11 | 2020-03-06 | 广州国音智能科技有限公司 | Incremental data clustering method, device and equipment and readable storage medium |
CN111324725B (en) * | 2020-02-17 | 2023-05-16 | 昆明理工大学 | Topic acquisition method, terminal and computer readable storage medium |
CN113806528A (en) * | 2021-07-07 | 2021-12-17 | 哈尔滨工业大学(威海) | Topic detection method and device based on BERT model and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488150A (en) * | 2009-03-04 | 2009-07-22 | 哈尔滨工程大学 | Real-time multi-view network focus event analysis apparatus and analysis method |
CN101571853A (en) * | 2009-05-22 | 2009-11-04 | 哈尔滨工程大学 | Evolution analysis device and method for contents of network topics |
EP2397985A1 (en) * | 2010-06-15 | 2011-12-21 | Honeywell International Inc. | System for multi-modal data mining and organization via elements, clustering and refinement |
CN102831193A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | Topic detecting device and topic detecting method based on distributed multistage cluster |
CN103745000A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | Hot topic detection method of Chinese micro-blogs |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070260586A1 (en) * | 2006-05-03 | 2007-11-08 | Antonio Savona | Systems and methods for selecting and organizing information using temporal clustering |
-
2015
- 2015-01-26 CN CN201510039493.4A patent/CN104715014B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101488150A (en) * | 2009-03-04 | 2009-07-22 | 哈尔滨工程大学 | Real-time multi-view network focus event analysis apparatus and analysis method |
CN101571853A (en) * | 2009-05-22 | 2009-11-04 | 哈尔滨工程大学 | Evolution analysis device and method for contents of network topics |
EP2397985A1 (en) * | 2010-06-15 | 2011-12-21 | Honeywell International Inc. | System for multi-modal data mining and organization via elements, clustering and refinement |
CN102831193A (en) * | 2012-08-03 | 2012-12-19 | 人民搜索网络股份公司 | Topic detecting device and topic detecting method based on distributed multistage cluster |
CN103745000A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | Hot topic detection method of Chinese micro-blogs |
Non-Patent Citations (1)
Title |
---|
一种面向网络话题发现的增量文本聚类算法;殷风景等;《计算机应用研究》;20110131;第28卷(第1期);第54-57页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104715014A (en) | 2015-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104715014B (en) | A kind of online topic detecting method of news | |
Aldino et al. | Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency | |
CN106447285B (en) | Recruitment information matching method based on multi-dimensional domain key knowledge | |
CN103207899B (en) | Text recommends method and system | |
CN106611052A (en) | Text label determination method and device | |
CN102609407B (en) | Fine-grained semantic detection method of harmful text contents in network | |
Ma et al. | Course recommendation based on semantic similarity analysis | |
CN111104526A (en) | Financial label extraction method and system based on keyword semantics | |
Zhu et al. | Small-world phenomenon of keywords network based on complex network | |
Bojović et al. | An overview of forestry journals in the period 2006–2010 as basis for ascertaining research trends | |
CN105740404A (en) | Label association method and device | |
CN102831193A (en) | Topic detecting device and topic detecting method based on distributed multistage cluster | |
CN104572616A (en) | Method and device for identifying text orientation | |
CN107291895B (en) | Quick hierarchical document query method | |
Limsettho et al. | Automatic unsupervised bug report categorization | |
CN109408743A (en) | Text link embedding grammar | |
Jeong et al. | Intellectual structure of biomedical informatics reflected in scholarly events | |
CN103886020A (en) | Quick search method of real estate information | |
Huang et al. | Identification of topic evolution: network analytics with piecewise linear representation and word embedding | |
KR101955244B1 (en) | Method of evaluating paper and method of recommending expert | |
Ngo et al. | Domain specific entity recognition with semantic-based deep learning approach | |
Jaiswal et al. | Detecting spam e-mails using stop word TF-IDF and stemming algorithm with Naïve Bayes classifier on the multicore GPU. | |
Kalra et al. | Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach | |
Réchauchère et al. | An innovative methodological framework for analyzing existing scientific research on land-use change and associated environmental impacts | |
CN109344232A (en) | A kind of public feelings information search method and terminal device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171010 |
|
CF01 | Termination of patent right due to non-payment of annual fee |