CN105320646A

CN105320646A - Incremental clustering based news topic mining method and apparatus thereof

Info

Publication number: CN105320646A
Application number: CN201510788690.6A
Authority: CN
Inventors: 于瑞国; 喻梅; 谢晓东; 杨龙; 赵满坤; 徐天一
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2016-02-10

Abstract

The present invention discloses an incremental clustering based news topic mining method and an apparatus thereof. The mining method comprises: performing preprocessing on an input text; performing feature extraction on the preprocessed text, and creating text representation models; calculating a similarity degree of the text representation models, and according to the similarity degree, performing topic clustering; ranking clustering results to obtain a Chinese ranking result of hot topics; combining a machine translation from English to Chinese, and acquiring an English ranking result of the hot topics; and performing weighting on the Chinese ranking result and the English ranking result, and acquiring a final ranking of the hot topics. The mining apparatus comprises: an establishing module, a clustering module, a first acquiring module, a second acquiring module and a third acquiring module. The incremental clustering based news topic mining method and the apparatus thereof provided by the present invention can be used to help a network news user solve a problem of information overloading, provide an information basis for security decision-making of an Internet supervision department, and are beneficial to promote considerable development and progress of society.

Description

A kind of news topic method for digging based on increment cluster and device thereof

Technical field

The present invention relates to data mining, natural language processing and machine learning field, particularly relate to a kind of news topic method for digging based on increment cluster and device thereof.

Background technology

Along with the fast development of Internet technology, network information increases rapidly with exponential speed, and network has become the main source of public's obtaining information.The awkward situation of absence of information has not only existed; Otherwise quantity of information overload then becomes the problem of current serious.External topic detection and tracking (TDT) research starting is more morning than domestic, first initiated in 1997 by the U.S., the numerous well-known scholar of the First-class University such as Carnegie Mellon University (CMU) participated at that time, initial achievements is achieved to TDT research, obtains valuable experience.For the research of TDT, domestic ratio is abroad started late, and the meeting of TDT system evaluation introduced Chinese from 1999.2003, two scholar Li Baoli and Yu Shiwen of Peking University introduced the present Research in this field of topic detection and tracking and Main Task, and introduce Task and the technical way of TDT.

Topic domestic at present finds that systematic study mainly concentrates in microblogging and each large BBS, and the emotion mainly for microblogging or BBS forum user is analyzed, and the focus for Web news finds that systematic research is fewer.And although existing clustering method has a lot, the focus for Web news finds field, still neither one clustering algorithm, can take into account efficiency and accuracy rate allows it reach an equilibrium point.

That is, existing clustering algorithm mostly is the on-line talking for news report, and what more focus on is the high efficiency of algorithm, although improve clustering algorithm time complexity, the accuracy rate of algorithm is unsatisfactory.

Summary of the invention

The invention provides a kind of news topic method for digging based on increment cluster and device thereof, invention increases the accuracy rate that news is excavated, described below:

Based on a news topic method for digging for increment cluster, described method for digging comprises the following steps:

Pre-service is carried out to input text; Feature extraction is carried out to text after pre-service, sets up text representation model;

Calculate the similarity size between text representation model, carry out topic cluster by similarity;

Rank is carried out to cluster result, obtains the Chinese ranking result of much-talked-about topic;

In conjunction with the English mechanical translation to Chinese, obtain the English ranking result of much-talked-about topic;

Chinese ranking result and English ranking result are weighted, obtain the final ranking of much-talked-about topic.

Describedly carry out feature extraction to text after pre-service, the step setting up text representation model is specially:

Text after pre-service is expressed as computing machine can process and the representation that can embody file characteristics;

Usage space vector model method sets up pretreated text representation model.

Similarity size between described calculating text representation model, the step of being carried out topic cluster by similarity is specially:

In units of document, calculate angle and the similarity thereof of document vector and topic vector, if topic set is not empty, calculate the angle between all topics in this section of report and topic set, the minimum value of getting angle is denoted as Smax;

If Smax is less than threshold value T2, adds this topic and upgrade Feature Words and the weight of this topic with this section of report; Or,

If Smax is not less than threshold value T2, be greater than threshold value T1, in topic set, create a new topic; Or,

If Smax is between threshold value T2 and T1, then report is added in topic corresponding to Smax.

Based on a news topic excavating gear for increment cluster, described excavating gear comprises:

Set up module, for carrying out pre-service to input text; Feature extraction is carried out to text after pre-service, sets up text representation model;

Cluster module, for calculating the similarity size between text representation model, carries out topic cluster by similarity;

First acquisition module, for carrying out rank to cluster result, obtains the Chinese ranking result of much-talked-about topic;

Second acquisition module, for combining the English mechanical translation to Chinese, obtains the English ranking result of much-talked-about topic;

3rd acquisition module, for being weighted Chinese ranking result and English ranking result, obtains the final ranking of much-talked-about topic.

The beneficial effect of technical scheme provided by the invention is: the present invention applies topic detection and tracking technology widely, as network public sentiment information monitoring, internet financial analysis, network forum information monitoring and the network information security etc., that can collect from each information source is numerous, dazzling information, can be formed open-and-shut after by data mining analyzing and processing being carried out to the information collected, attract popular much-talked-about topic and find out accident, Internet news user is helped to solve problem of information overload, security decision for internet supervision department provides information foundation, be conducive to the tremendous development and the progress that promote society.

Accompanying drawing explanation

Fig. 1 is a kind of process flow diagram of the news topic method for digging based on increment cluster;

Fig. 2 is the accuracy rate cylindricality comparison diagram of this method and conventional delta clustering algorithm;

Fig. 3 is the recall rate cylindricality comparison diagram of this method and conventional delta clustering algorithm;

Fig. 4 is the F value cylindricality comparison diagram of this method and conventional delta clustering algorithm;

Fig. 5 is a kind of structural representation of the news topic excavating gear based on increment cluster.

In accompanying drawing, the list of parts representated by each label is as follows:

1: set up module; 2: cluster module;

3: the first acquisition modules; 4: the second acquisition modules;

5: the three acquisition modules.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below embodiment of the present invention is described further in detail.

The main task that much-talked-about topic detects obtains hot news from text class news information, and by the focus incident classification of news report according to its ownership, the correlation technique of the technology mainly text information processing of application: comprise Chinese word segmentation, Text character extraction, Text similarity computing, clustering algorithm etc., several technology that therefore embodiment of the present invention directly relates to from text information processing are started with and are launched research.Main research has the following aspects:

First, in Chinese word segmentation, the segmenting method of Corpus--based Method has the advantages such as participle accuracy rate is high, word segmentation result ambiguity is few, but the result of participle also depends on the scale of training corpus to a certain extent.Just only occurring in training corpus for some emerging network words can be identified, in order to increase the discrimination to emerging network vocabulary, must strengthen the cost of training corpus, this reduces again the efficiency of participle to a certain extent.Mechanical segmentation method based on dictionary has the advantage that efficiency is high, be easy to realization, but owing to not paying close attention to vocabulary and contextual relation, thus word segmentation result accuracy rate will lower than the segmenting method of Corpus--based Method.After the two is simply combined, while making the accuracy rate of Chinese word segmentation and efficiency can reach gratifying result, current improvement Chinese word segmentation result is made again to be more suitable for follow-up much-talked-about topic Detection task.

Secondly, to the improvement of clustering algorithm.For meeting the needs that much-talked-about topic detects, and overcome classical single pass method (Single-Pass) and process in network text process and be subject to input sequence impact and the lower deficiency of precision, the embodiment of the present invention proposes a kind of improving one's methods based on classical Single-Pass clustering algorithm, balance can be reached, although sacrifice certain efficiency compared with traditional Si ngle-Pass algorithm to have gained the lifting in accuracy rate and recall rate between efficiency and accuracy rate.

Finally, topic cluster result is analyzed, draw the temperature rank of news in a period of time, and in conjunction with machine translation mothod, foreign news portal website is translated into Chinese text for the report of China News, after pre-service, cluster obtains foreign news seniority among brothers and sisters, then is attached in home news rank by certain method of weighting.Analyze different to the focus of media event both at home and abroad, and analyze the underlying causes of this species diversity to a certain extent.

Method of weighting is as follows:

Q_{j} = α \frac{{Top}_{j}}{Σ {Top}_{i}} + β \frac{{Topic}_{j}}{Σ {Topic}_{i}}

Q _jrepresent the hot value of topic j, wherein, Top _irepresent the report amount of Chinese topic i, Topic _irepresent the report amount of topic i in English media, Top _jrepresent the report amount of Chinese topic j, Topic _jrepresent the report amount of topic j in English media.α and β value in the embodiment of the present invention is 1/2.

Embodiment 1

Embodiments provide a kind of news topic method for digging based on increment cluster, see Fig. 1, the method comprises the following steps:

101: pre-service is carried out to input text;

Wherein, this pre-treatment step is for carry out Chinese word segmentation to input text, concrete operation step is as follows: the method combined with emerging Dictionary based segment method by condition random field (CRF) point morphology, service condition random field (CRF) point morphology marks Chinese character, namely by word word-building (group word), not only consider the frequency information that word word occurs, consider context of co-text simultaneously, again by an emerging vocabulary dictionary, by separate in CRF word segmentation result but the proper noun occurred in emerging vocabulary dictionary be combined.

102: feature extraction is carried out to text after pre-service, set up text representation model after pre-service;

After this pre-service, text representation model is specially: text after pre-service is expressed as computing machine and can processes and the representation that can embody file characteristics, and calculate the similarity between data object.

This step is specially: usage space vector model method represents model to text after setting up pre-service;

V(d)＝(t ₁,w ₁(d)；t ₂,w ₂(d)；...t _m,w _m(d))

Wherein, V (d) representation space vector model; t _i(i=1,2,3 ..., m) be the Feature Words in document d; w _i(d) (i=1,2,3 ..., m) be Feature Words t _iweight.After pre-service, text uses the weighing vector be made up of Feature Words to represent, every section of document is all expressed as a vector, determines that the method for term weight function has a lot, such as: word frequency method (TF), TF-IDF etc.

103: by the similarity size after calculating pre-service between text representation model, conventional method uses classical single pass method (Single-Pass) to carry out topic cluster, a set is divided into different classes bunch, namely, be in similarity between the object in same class bunch higher, the object similarity between inhomogeneity bunch is very low;

But, the embodiment of the present invention improves incremental clustering algorithm by upgrading selector switch, i.e. ICCQ algorithm, improve single-pass cluster about " butterfly effect " of makeing mistakes: namely, be extracted in existing algorithm angle and it seems the result confirmed completely, only upgrade Feature Words and the weight of topic by these data, add band queuing, accuracy rate is improved significantly.

104: rank is carried out to cluster result, obtain the Chinese ranking result of much-talked-about topic;

105: solved the unicity problem choosing report angle by mechanical translation.

Because previous step finally obtains just much-talked-about topic Chinese ranking result, in order to solve the unicity problem choosing report angle, in conjunction with existing machine translation software, automatic translation foreign language news cluster becomes much-talked-about topic, extract keyword and Chinese news topic Keywords matching, comprehensively draw the much-talked-about topic rank of a more novel multi-angle.

In sum, the embodiment of the present invention improves the accuracy rate of news excavation by above-mentioned steps 101-step 105, help Internet news user to solve problem of information overload, the security decision for internet supervision department provides information foundation, is conducive to the tremendous development and the progress that promote society.

Embodiment 2

Below in conjunction with concrete computing formula, example, the scheme in embodiment 1 is described in detail:

201: word segmentation processing is carried out to input text;

During practical application, can see from the Chinese word segmentation result of CRF participle, CRF participle just can proprietary emerging noun separately, the embodiment of the present invention is by an emerging vocabulary dictionary, separate in word segmentation result but the proper noun occurred in emerging vocabulary dictionary be combined, just can improve the accuracy rate of Chinese word segmentation to a great extent.

202: from word segmentation result, delete stop words;

Usually containing more such words in document, as demonstrative pronoun " you ", " he ", auxiliary words of mood " ", preposition " " etc., their frequency of occurrences is very high, is auxiliary word conventional in statement, the information content of these words is usually all very little, and this class word is called as stop words.Actual removal in the process of stop words generally chooses suitable stop words dictionary, demonstrative pronoun, function word and some prepositions useless are filtered out, because the representativeness of these words to document is poor, and the frequency of occurrences is high, filter out these stop words and not only can promote overall operational efficiency, also can improve the effect of follow-up Text character extraction.

Achieve the preprocessing process to input text by above-mentioned steps 201 and step 202, above-mentioned preprocessing process can also adopt other disposal route, and the embodiment of the present invention does not limit this.

203: feature extraction is carried out to above-mentioned pretreated text, sets up text representation model;

Namely use vector space model (VSM) to represent text feature, pretreated text table is shown as computing machine and can processes and the representation that can embody file characteristics rightly.Vector space model (VSM) uses non-bi-values to carry out the weight of representation feature word.In vector space model, text uses the weighing vector be made up of Feature Words to represent, every section of document is all expressed as a vector.

Word frequency-inverse text frequency (TF-IDF) calculates the weighted value of Feature Words, uses TF-IDF to come again to carry out vectorization to document.TF-IDF is actually: TF*IDF, TF word frequency (TermFrequency), the reverse document-frequency of IDF (InverseDocumentFrequency).TF represents the frequency that word occurs in document d.The main thought of IDF is: if the document comprising word t is fewer, namely document n is less, and IDF is larger, then illustrate that word t has good class discrimination ability.

If the number of files comprising word t in a certain class document C is m, and the total number of documents that other class comprises t is w, obviously all number of files n=m+w comprising t, when m is large time, n is also large, and the value of the IDF obtained according to IDF formula can be little, just illustrates that this word t class discrimination is indifferent.But in fact, if a word frequently occurs in the document of a class, then illustrate that this word can represent the text feature of this class very well, such word should give higher weight to them, and choosing is used as the Feature Words of this class text with difference and other class document.The weak point of Here it is IDF.

In the file that portion is given, word frequency refers to the frequency that some given words occur in this document.This numeral is the normalization to word number (termcount), to prevent the file that its deflection is long.For the word in a certain specific file, its importance can be expressed as formula (1):

{TF}_{i, j} = \frac{n_{i, j}}{\underset{k}{Σ} n_{k, j}} - - - (1)

Wherein, n _i,jbe the number of times that word i occurs in text j, denominator is then all word occurrence number sums in text j; TF _i,jfor the frequency that word occurs in text j; K represents any one word of text j.

Reverse document-frequency (inversedocumentfrequency, IDF) is the tolerance of a word general importance.The IDF of a certain particular words, can by general act number divided by the number of documents comprising this word, then the business obtained is taken the logarithm and obtain formula (2):

{IDF}_{i} = \log \frac{| D |}{| {j : t_{i} &Element; d_{j}} |} - - - (2)

Wherein, IDF _ifor inverse text frequency; | D| refers to the text sum in corpus; | { j:t _i∈ d _j| refer to the text number comprising word i; t _ifor Feature Words; d _jfor comprising the text of feature word.Finally, the weight (TFIDF value) of word can be expressed as formula (3):

TFIDF _i＝TF _i*IDF _i(3)

Wherein, TFIDF _ifor inverse text frequency; TF _ifor feature term frequencies.

High term frequencies in a certain particular document, and the low document frequency of this word in whole file set, can produce the TF-IDF of high weight.Therefore, TF-IDF tends to filter out common word, retains important word.

204: calculate similarity, the balancing method of the similarity between Text eigenvector is the pith of cluster, and the selection of similarity calculating method directly affects the accuracy of cluster result;

Suppose there are two vectorial d ₁=(a ₁, a ₂..., a _n) and d ₂=(b ₁, b ₂..., b _n), wherein, n represents two vectorial dimensions, and a, b represent d respectively ₁and d ₂each dimension numerical value; Similarity calculating method has following three kinds, and the embodiment of the present invention takes second method to calculate similarity, and during specific implementation, the embodiment of the present invention does not limit this:

1) inner product similarity:

s i m (d_{1}, d_{2}) = d_{1} \cdot d_{2} = Σ_{i = 1}^{n} (a_{i} \times b_{i}) - - - (4)

2) cosine similarity:

s i m (d_{1}, d_{2}) = \frac{d_{1} \cdot d_{2}}{| d_{1} | \times | d_{2} |} = \frac{Σ_{i = 1}^{n} (a_{i} \times b_{i})}{\sqrt{Σ_{i = 1}^{n} a_{i}^{2} \times Σ_{i = 1}^{n} b_{i}^{2}}} - - - (5)

3) Jaccard similarity:

s i m (d_{1}, d_{2}) = \frac{d_{1} \cdot d_{2}}{{| d_{1} |}^{2} \times {| d_{2} |}^{2} - d_{1} \cdot d_{2}} = \frac{Σ_{i = 1}^{n} (a_{i} \times b_{i})}{Σ_{i = 1}^{n} a_{i}^{2} + Σ_{i = 1}^{n} b_{i}^{2} - Σ_{i = 1}^{n} (a_{i} \times b_{i})} - - - (6)

Wherein, a _i, b _ifor each dimension of corresponding vector.

205: there is the incremental clustering algorithm upgrading selector switch and round-robin queue: i.e. ICCQ algorithm;

The embodiment of the present invention have employed the mode arranging threshold determination: namely select threshold value angle T1, T2 (T1>T2) for given two, corresponding threshold value cosine value is ClusterS, ClusterT.Selecting threshold values just to constitute between a selection area for such two, there are three kinds of situations in report and the angle of topic set, angle be greater than T1, angle in T1, T2 interval, angle is less than T2.

1) need to calculate the TF*IDF of the Feature Words in each section of news report, get Feature Words and their TF*IDF value of before rank 100, according to Feature Words and weighted value, each section is reported that a vector regarded as by document.

The embodiment of the present invention 100 to be described for example before being, during specific implementation, the embodiment of the present invention does not limit this.

2) in units of document, calculate angle and the similarity thereof of document vector and topic vector, if topic set is empty, just first section of report is regarded as a topic, if topic set is not empty, just calculates the angle between all topics in this section of report and topic set, get the minimum value of this section of report and all topic angles, also be cosine similarity maximal value between topic simultaneously, be denoted as Smax.

3) the 2nd is judged) size between the minimum value that obtains of step and given threshold value:

If the minimum angle calculated is less than given threshold value T2, that is, Smax is greater than threshold value ClusterT, adds this topic and upgrades Feature Words and the weight of this topic with this section of report.

If the minimum angle calculated is not less than given threshold value T2, so compare the relation of this angle and threshold value T1, if the minimum angle calculated is greater than given threshold value T1, namely, Smax is less than threshold value ClusterS, illustrate that the otherness of existing topic in this section of document and topic set is all larger, this section report as a new topic, will create a new topic out in topic set.

If the minimum angle calculated is between threshold value T2 and T1, that is, Smax is greater than threshold value ClusterS and is less than threshold value ClusterT, then this section of report is added in the topic of this minimum angle, but can not upgrade Feature Words and the weight of this topic.

4) circulate above step, until there is not the document of non-cluster.

The concept of round-robin queue: the document report namely meeting certain condition, is temporarily deposited into them in a queue, waits for the cluster result of other document.Because other document will upgrade Feature Words and the weight of this topic after joining some topics, when thus same section document calculates the similarity with same topic again, the result obtained has difference.The embodiment of the present invention introduces the concept of round-robin queue, both can eliminate the input sequence of document to a certain extent to the impact of increment cluster result, the accuracy rate of cluster and recall rate can be made again to a certain degree to improve.

206: based on the statistical machine translation model of phrase.

(1) pre-service of experimental data

The experimental data that mechanical translation needs: mechanical translation training set, exploitation collection and test set.Mechanical translation also needs to use language model training file, and the embodiment of the present invention adopts search dog the whole network news corpus storehouse to add the single language language material of all Chinese of training set.

Data prediction is the first step of statictic machine translation system, training set, and exploitation collection, test set, language model file all needs to pass through data prediction.The pre-service of data mainly comprises: mess code filters, more extensive numbers, time word, date word, translation number, time word, date word etc.

(2) participle

English string segmentation instrument uses the participle software Lucene that increases income.Chinese word segmentation uses the CRF Chinese automatic word-cut improved.

(3) word alignment

What training set word alignment adopted is Open-Source Tools: GIZA++1.0.7

(4) train language model

The embodiment of the present invention adopts ox translation (NiuTrans) official model training module.

(5) decode

So-called decoding, namely refers to setting models parameter and sentence to be translated, and search makes the process of the translation result of maximum probability (or Least-cost).With many sequence labelling problems, such as Chinese word segmentation question marks seemingly, and decode search can adopt branch-and-bound or heuristic depth-first search (A*) method.In general, first searching algorithm constructs search network, namely permeate sentence to be translated and possible translation result a weighted finite state interpreter (WeightedFiniteStateTransducer), then searches for optimal path over the network.

The translation demoder that demoder in experiment provides for Niutrans, uses this demoder to carry out decoding translation to the report about China in the English news website captured.

Wherein, above-mentioned steps (1) is to step (5) for conventionally known to one of skill in the art, and also can adopt other phrase Machine Translation Model, during specific implementation, the embodiment of the present invention does not repeat this.

(6) topic detection

Obtain the translation result of foreign language news, this method uses the incremental clustering algorithm improved to news report cluster, show that report measures high topic set.

In sum, the embodiment of the present invention is also improved on this basis in conjunction with existing method for digging, makes the cluster result of much-talked-about topic more accurate.Then the embodiment of the present invention carries out mechanical translation to the report of foreign language news website to domestic hot news event captured again, two cluster results is combined simultaneously, obtains a comparatively objectively topic temperature rank, meets the needs in practical application.

Embodiment 3

This method (innovatory algorithm) traditional incremental clustering algorithm of comparing has improvement in 2 o'clock: one be join a certain topic text minute in order to two kinds of situations, Feature Words and the weight and directly adding upgrading topic does not upgrade; Two is that document to be sorted no longer only calculates and once just classifies with the angle of topic set, enters in queue, carry out next step calculating for the document satisfied condition.

Accuracy rate, recall rate and F value are the common metrics being widely used in information retrieval and Statistical Classification field, are widely used for the quality of evaluation result.In general, how much accuracy rate is accurately if being exactly that the entry (such as: document, webpage etc.) that is retrieved has, and all exactly entries accurately of recall rate have and how much are retrieved out.

Accuracy rate, recall rate and F value are in the environment that dragons and fishes jumbled together, select the important evaluation index of target.Accuracy rate, recall rate and F value are defined as follows separately:

1. the information number of the correct information number of accuracy rate=extract/extract

2. the information number in the correct information number/sample of recall rate=extract

Both values between zero and one, numerical value more close to 1, precision ratio or recall ratio higher.

3.F value=accuracy * recall rate * 2/ (accuracy+recall rate), F value is the harmonic-mean of accuracy and recall rate, and result is larger, shows that experimental result is better.

Chinese be it is reported, first document pre-service to be carried out to report, wherein it is crucial that Chinese word segmentation, be the topic detection field making final result more meet Internet news, the Chinese words segmentation used in experiment is the statistics Chinese words segmentation based on CRF improved.Then be the text vector of news documents, each section of news report is regarded as a vector, the angle between compute vector determines the distance between news report.Utilize the incremental clustering algorithm improved to news report cluster, the topic report number in statistics rank prostatitis, draws the temperature rank of news.

English be it is reported, increase income except participle software difference except what use when pre-service, also the English news report of reply carries out mechanical translation, obtains corresponding Chinese news, then identical to Chinese news application topic detecting method, draws the temperature rank of news.

Two topic results also will by the coupling of keyword, and weighting obtains final topic temperature rank.

Experiment is sorted according to its issuing time from front to back to 6000 sections of news report, use the incremental clustering algorithm of original increment clustering method and improvement to carry out cluster to news report document respectively, calculated accuracy rate, recall rate, the F value of testing result by the data manually marked.

Compared mutually by accuracy rate, recall rate, F value, can reach a conclusion, on the less demanding field of efficiency of algorithm, this method has and to a certain degree promotes in accuracy rate, recall rate and F value.Although make efficiency of algorithm in experiment decrease, but be to take sky as the off-line topic detection of chronomere due to research emphasis, when time span selects sky to be unit, obviously add document selector and treat that the incremental clustering algorithm of queuing model can be satisfied the demand completely on efficiency of algorithm, the way that therefore this method is improving algorithm accuracy rate, recall rate and F value has the certain significance.

Table 1 is depicted as the cluster result using original incremental clustering algorithm, the result then for using the incremental clustering algorithm improved to obtain in table 2.

Table 1 conventional delta clustering algorithm experimental result

Table 2 this method experimental result

Can see by the improvement to clustering algorithm, after this method adds renewal selector switch and treats queuing, the accuracy rate of five topics and recall rate have improvement to a certain degree, topic cluster accuracy rate about Ukraine's situation brings up to 0.9484 by 0.8876, about the winter, the topic cluster accuracy rate of Austria brings up to 0.8688 by 0.4966, all the other three topics chosen, CBA, haze, software of calling a taxi all have raising in various degree in accuracy rate recall rate two.

The embodiment of the present invention calculates the accuracy rate of five topics and recall rate mean value again, show that the mean value of accuracy rate is brought up to 0.8217. recall rate mean value by 0.6148 and brought up to 0.8524 by 0.8123.Relatively F value, the F value of five topic set has raising in various degree.

Embodiment 4

Based on a news topic excavating gear for increment cluster, see Fig. 5, this excavating gear comprises:

Set up module 1, for carrying out pre-service to input text; Feature extraction is carried out to text after pre-service, sets up text representation model;

Cluster module 2, for calculating the similarity size between text representation model, carries out topic cluster by similarity;

First acquisition module 3, for carrying out rank to cluster result, obtains the Chinese ranking result of much-talked-about topic;

Second acquisition module 4, for combining the English mechanical translation to Chinese, obtains the English ranking result of much-talked-about topic;

3rd acquisition module 5, for being weighted Chinese ranking result and English ranking result, obtains the final ranking of much-talked-about topic.

This cluster module 2, in units of document, calculates angle and the similarity thereof of document vector and topic vector, if topic set is not empty, calculate the angle between all topics in this section of report and topic set, the minimum value of getting angle is denoted as Smax;

In sum, the embodiment of the present invention is also improved on this basis in conjunction with existing excavating gear, makes the cluster result of much-talked-about topic more accurate.Then the embodiment of the present invention carries out mechanical translation to the report of foreign language news website to domestic hot news event captured again, two cluster results is combined simultaneously, obtains a comparatively objectively topic temperature rank, meets the needs in practical application.

The embodiment of the present invention is to the model of each device except doing specified otherwise, and the model of other devices does not limit, as long as can complete the device of above-mentioned functions.

It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. based on a news topic method for digging for increment cluster, it is characterized in that, described method for digging comprises the following steps:

2. a kind of news topic method for digging based on increment cluster according to claim 1, is characterized in that, describedly carries out feature extraction to text after pre-service, and the step setting up text representation model is specially:

Usage space vector model method sets up pretreated text representation model.

3. a kind of news topic method for digging based on increment cluster according to claim 1, is characterized in that, the similarity size between described calculating text representation model, and the step of being carried out topic cluster by similarity is specially:

4. based on a news topic excavating gear for increment cluster, it is characterized in that, described excavating gear comprises: