CN109710728A

CN109710728A - News topic automatic discovering method

Info

Publication number: CN109710728A
Application number: CN201811417992.2A
Authority: CN
Inventors: 崔莹; 代翔; 刘宇; 王侃; 彭易锦; 黄细凤; 丁洪丽; 杨露
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2019-05-03
Anticipated expiration: 2038-11-26
Also published as: CN109710728B

Abstract

A kind of news topic automatic discovering method disclosed by the invention, it is desirable to provide a method of it can be improved the accuracy of news topic discovery.The technical scheme is that: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is clustered in batches, pretreatment input text, text formatting coding is unified to article, calculate text feature, generate Text eigenvector, extract text feature word, construct Text eigenvector collection, Subject Clustering is first done in batch, theme internal layer time cluster is done again, then each single-point theme is calculated at a distance from similarity i.e. each single-point to each cluster centre of all clusters, it is merged into maximum class cluster, cluster across batch is agglomerated mutually, hierarchical clustering between completion theme；It generates news topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, newly-increased data clusters result is done with existing cluster result then and is merged across batch.

Description

News topic automatic discovering method

Technical field

The present invention relates to Text Mining Technology field more particularly to a kind of excavations of hot news topic, mixed based on multistep The news topic for closing increment cluster finds method.

Background technique

Internet news is one that people obtain news messages as one of information type most important in the Internet media Important channel.Internet is related to the various aspects of life, and webpage information is being increased in a manner of blowout on internet, newly It hears data and shows magnanimity, facing to the webpage information of the magnanimity so to grow at top speed, for common netizen, it is desirable to search The network information required for individual becomes more and more difficult, for government department, timely and accurately finds common net The hot topic of people's care simultaneously controls network public-opinion, has become a critically important Men Xueke.It is new due to what is increased severely on network The characteristics such as many and diverse redundancy and redundancy are heard, the workload that news topic is searched by manually is huge.To people from magnanimity news It obtains information needed and brings huge challenge.Although the portal website of domestic some large sizes is often to most of within certain time The hot topic and focus event that ordinary user pays close attention to, delivery network Special Topics in Journalism, but website still mainly passes through at this stage It is manually screened and is edited, relevant news is put into a special news, this mode has very big drawback, not It can effectively meet the needs of users in time, and often the personal view containing web editor is inclined to for the inside, it cannot be completely objective See the truth of neutral zone reaction event.Therefore, it how fast and accurately to be obtained in the webpage information from these in terms of tens billion of general The popular information of logical netizen's concern and hot topic, the major event especially occurred in the recent period have become common netizen Demand true.Wherein, event (Event) refers at what time, what place, is related to the spy what personage occurs Determine event.Topic (Topic) is regarded it as by an emergency event and the dependent event caused by this emergency event forms, can To think set that a topic is made of multiple events.Report (Story) is the related news report to some event, main Report the contents such as Internet news contribution, TV news report segment and Broadcast Journalism casting.The cutting of news report just needs This section of news report is cut into, the Reporting about different topics is divided into independent news report segment.Newspeak The main task of the detection TD of topic is to find and identify the hot spot thing of newest generation from the news report information flow generated in real time Part and hot topic.Classical topic detection and tracking system be all built upon the content of text of news report be transformed into On this basis quantity space model VSM, but VSM can bring noise information when carrying out text representation, this can give new thing The detection of part brings error.The topic detection system of actual use is generally required in real time increase news report document Amount cluster, it is desirable that can rapidly and accurately find the event and sudden critical event of kainogenesis.The digging of hot news topic Pick has more application value relative to topic detection and tracking, the automatic discovery of hot news topic.Traditional topic finds master Be directed to long text and news data collection, extensive short text have the characteristics that it is sparse, without structure, make an uproar more, conventional method is very Difficult effectively discovery topic., all there are some problems, such as focus incident in present existing hot news topic discovery system The phenomenon that accuracy rate of detection is not high, and mistake is divided is very universal, and most of hot news event only carries out news report It simply enumerates, there is no the extractions that these focus incidents are carried out with Feature Words and summary info, this is unfavorable for the news of user very much Browsing and news are read.The focus incident topic of many portal websites be mostly by manually being edited and being organized, This mode can only choose those one month or 1 year hot topic, can not issue newest unexpected incidents in time, Real-time is not strong.Although it is interested that topic discovery technique can help people quickly to obtain from internet information resource abundant Information, help government, enterprises and institutions to grasp newest public sentiment dynamic in time.But it can not traditional file classification method is straight Scoop out and use in topic detection system, news report document handled by topic detection and tracking be dynamic it is increased, with when Between passage, news documents are gradually increasing, if directly text classification or cluster using the increased news text of these dynamics On shelves, number of documents is excessive, and the Space-time Complexity needed is excessively high, this needs to carry out existing text classification and clustering algorithm The characteristics of improvement, Lai Shiying topic detection and tracking.

Topic is the discovery that finds topic from magnanimity news data automatically, is that index combs data with topic, will Content related with topic is polymerize, is organized together.Topic finds that method is substantially by way of increment cluster by magnanimity Text in text data with Similar content gathers in identical theme, so that topic height is similar between text in same class, Topic similarity is low between different class texts.The method of traditional text increment cluster is broadly divided into two classes:

First kind method is that all data are iterated cluster every time, and certain interval of time again carries out all data primary Property cluster, advantage is precision height, the disadvantage is that the cluster result of front cannot be utilized, waste of resource, each cluster result cannot be protected Demonstrate,prove consistency；Second class method is the cluster result before utilizing, and newly-increased data are divided into class nearest from it in existing class cluster In cluster, and the cluster mass center is recalculated, advantage is not need to recalculate cluster to all data every time, the disadvantage is that with class cluster Continuous increase, class cluster is easy to happen mass center drift, and topic can not possess duration, and because be by new data and existing class cluster Similitude comparison is carried out, therefore new class cluster can not be generated, it is low to generate topic accuracy rate.Meanwhile existing increment clustering method leads to Cluster task often is completed only with a kind of clustering method, therefore there is problems: because cluster is a kind of unsupervised learning Method, needs specified initial clustering number, and conventional method is not accounted for due to randomly choosing cluster matter during clustering initialization The heart leads to have the problem of similitude between cluster class cluster that may be present, i.e., the data for obviously belonging to same topic are divided into not With in topic；Further fusion treatment is not done to the single-point class cluster generated in disposable cluster process, leads to that there are multiple It only include the topic of single text.

A kind of mixing increment clustering method is applied in news topic discovery analysis by present invention research, and technology is conceived to solution Certainly above-mentioned increment clustering method drawback present in topic discovery procedure, and propose corresponding increment cluster topic discovery analysis Process.

Summary of the invention

Object of the present invention is to be conceived to solve above-mentioned increment clustering method drawback present in topic discovery procedure, provide One kind corresponding to increment cluster topic and based on mixing increment cluster, realize to the topic tissue of magnanimity news data and discovery, The continuity for keeping existing topic improves the news topic automatic discovering method of the accuracy of topic discovery,

Above-mentioned purpose of the invention can realize that a kind of news topic automatic discovering method, feature exists by following measures In including the following steps: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is carried out in batches Secondary cluster carries out Text Pretreatment operation to input text in batch, obtains batch of articles N, unify text formatting to it It encodes, carries out Chinese word segmentation, goes additional character and stop words, then calculate text feature, generate Text eigenvector, extract text Eigen word constructs Text eigenvector collection, and first does Subject Clustering in batch, then is theme internal layer time cluster, subdivision master All clusters being previously obtained then between hierarchical clustering theme, are merged theme: tried again bottom-up by topic Coagulation type hierarchical clustering, it is poly- that a top-down Split type stratification is done for all articles in each theme Article in theme is gradually subdivided into smaller and smaller cluster by class, to small cluster obtained above carry out single-point theme merge and Non- single-point merger selects relatively more similar article in m articles, and single-point theme merges between doing batch, calculates each single-point Every article of theme selects more than threshold value at a distance from the similarity of all clusters and each single-point to each cluster centre A maximum class cluster do and merge, then carry out clustering between batch, between the group cluster Jing Guo standalone processes result again Primary bottom-up coagulation type hierarchical clustering is done, the cluster across batch is agglomerated mutually, level is poly- between completing theme Class；To cluster result, according to keyword weight sequencing, extraction represent the highest previous group word of a news topic weight again, generate new It hears topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, meeting threshold value then will be new Class cluster merges with existing history class cluster, is otherwise used as new class cluster；Then newly-increased data clusters result is done with existing cluster result Across batch fusion.

The beneficial effects of the present invention are:

The present invention be different from traditional news media topic find method, the increment clustering method of use, by magnanimity news data into Row batch is divided, is clustered in batches, cluster result merges in batch, cluster result melts between the fusion of single-point topic, batch in batch Close etc. multisteps clustering processing, using multistep cluster, merge, cluster again by the way of improve resource utilization and topic discovery result Accuracy.By being clustered in batches to incremental data, so that cluster result can use existing cluster result every time, no It needs to do clustering processing again to all data every time, economize on resources, and guarantee the consistency of cluster result.

The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, achievees the purpose that segment theme, protect Similitude is low between demonstrate,proving theme, and independence is strong.Further by hierarchical clustering between theme, solve due to during clustering initialization with The problem of same topic caused by machine selection cluster mass center is split.By to the single-point class generated in disposable cluster process Cluster does further fusion treatment, solves the topical questions there are multiple comprising single text.To newly-increased data clusters result with Existing cluster result does across batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.

Cluster result consistency.The present invention passes through setting increment Clustering Trigger parameter first, carries out in batches to incremental data Secondary cluster does not need every time to do at cluster all data again so that cluster result can use existing cluster result every time Reason, economizes on resources, and guarantee the consistency of cluster result.

Theme independence.The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, reaches subdivision master The purpose of topic, similitude is low between guaranteeing theme, and independence is strong.

Theme cutting.Hierarchical clustering is solved due to choosing random during clustering initialization between the present invention further passes through theme Select the problem of same topic caused by cluster mass center is split.

Single-point theme.The present invention is by doing at further fusion the single-point class cluster generated in disposable cluster process Reason solves the topical questions there are multiple comprising single text.To newly-increased data clusters result and existing cluster result do across Batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.

Detailed description of the invention

Fig. 1 is that news topic of the present invention finds processing flow schematic diagram automatically.

Fig. 2 is the schematic diagram of Fig. 1 Text Pretreatment process.

Fig. 3 is the embodiment flow diagram of multistep mixing increment clustering processing of the present invention.

To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.

Specific embodiment

Refering to fig. 1.According to the present invention, increment cluster correlation parameter and increment Clustering Trigger parameter are set first, to increment Data are clustered in batches；Text Pretreatment operation is carried out to input text, batch of articles N is obtained, text is unified to it Said shank carries out Chinese word segmentation, goes additional character and stop words, and first doing Subject Clustering in batch, then does theme internal layer Secondary cluster completes hierarchical clustering between theme, then calculates text feature, generates Text eigenvector；Extract text feature word, structure Build Text eigenvector collection；Judge whether input text data amount meets single batch cluster amount of text batchsize or time Whether interval meets single batch cluster time interval timeout, if reaching single batch cluster time interval, list has not been reached yet Batch clusters amount of text, then cluster is automatically begun to, and does to the single-point class cluster generated in disposable cluster process and further melts Conjunction processing carries out mixing increment clustering processing to input text, generates news topic；To cluster result, according to keyword weight is arranged Sequence extracts the highest previous group word of weight to represent a news topic.By new class cluster mass center and existing history class cluster mass center It is compared, meets threshold value and then merge new class cluster with existing history class cluster, and carry out new class cluster fusion, be otherwise used as new class Cluster, and new class cluster fusion is carried out, newly-increased data clusters result is done with existing cluster result and is merged across batch, then cluster is tied Fruit takes keyword to indicate topic；Judge whether to continue with again, otherwise terminate, be, persistently receive data, returns to input text This progress Text Pretreatment operation executes circulate operation to terminating.The single-point class cluster generated in disposable cluster process is done Further fusion treatment is done newly-increased data clusters result with existing cluster result and is merged across batch, the final standard for improving topic True property and the continuity for guaranteeing topic.

Increment cluster correlation parameter setting includes: single batch cluster amount of text batchsize, between the single batch cluster time Every hierarchical clustering similarity threshold between hierarchical clustering similarity threshold wordSimThreshold, theme in timeout, theme Single-point clusters the single-point similarity threshold compared with having cluster class cluster in value wordSimThreshold, batch Single-point clusters the similarity threshold compared with having cluster class cluster between innerBatchSPKNNThreshold and batch crossBatchSPThreshold。

In the reason process of following embodiment, using following steps:

[Text Pretreatment]: obtaining batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character With stop words, calculating text feature, Text eigenvector is generated；

[judgement of automatic cluster condition]: judge to input whether amount of text meets single batch cluster amount of text, if reached Time interval is clustered to single batch, single batch cluster amount of text has not been reached yet, then cluster is automatically begun to；

[in batches cluster]: by N articles be divided into it is several in small batches, clustering processing in batch is done for each small batch data, if currently Have m articles in batch, and carry out Subject Clustering in batch: by each it is small quantities of in the Feature Words of all article do theme and gather Class obtains some themes, this m article is associated with these themes, selects a maximally related theme for each article, If independently generating free theme without related subject；Hierarchical clustering in theme is carried out again, segments theme: being based on article Between similarity include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. refine article Theme；One top-down Split type hierarchical clustering is done for all articles in each theme, then between theme Hierarchical clustering, merge theme: for all clusters being previously obtained, try again bottom-up coagulation type hierarchical clustering (being equivalent to the cluster across theme) solves same subject and is split as multiple themes；Single-point theme is reduced, single-point theme merges: Merger is done to some small clusters obtained in process above, including single-point and non-single-point, for each single-point, at this m Selection and their relatively more similar articles, solve the problems, such as that single-point theme is excessive in article；

[between batch single-point theme merge]: the cluster obtained for every a batch, these clusters can be divided into two classes again, single-point and non- Single-point, tries again and similar process, the similarity of every article Yu all clusters is calculated each single-point theme, to every A single-point calculates the distance that it arrives each cluster centre, and the maximum class cluster for selecting more than threshold value, which is done, to be merged；

[clustering between batch]: the result (i.e. a group cluster) for passing over standalone processes tries again the bottom of between them Upward coagulation type hierarchical clustering, this step are needed here due to being to have done to have no basis in batches to article in step 3 Cluster across batch is agglomerated mutually.

[generate news topic]: to cluster result according to keyword weight sequencing, extract the highest previous group word of weight to Represent a news topic.

[new class cluster fusion]: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class Cluster merges with existing history class cluster, is otherwise used as new class cluster.

Refering to Fig. 2.By segmenting in this present embodiment using jieba, jieba participle is to processing text formatting requirement GBK coding, therefore input data is converted to GBK coded format by the present embodiment unification, and UTF-8 format is used after word segmentation processing It saves.Text Pretreatment process provided in this embodiment carries out pretreatment operation to input text.Specifically include the following steps:

All news documents are initialized as a major class cluster first, the division of class cluster is then carried out, major class cluster is split into Group cluster, until the threshold value of one of class cluster meets preset threshold value.Unified text encoding format, input data format is united One switchs to the Chinese character codes such as UTF-8 or GBK, and actual coding format file format according to needed for participle step is arranged；It unites to format Text data after one carries out Chinese word segmentation processing, and as described in previous step, the present embodiment is segmented using jieba, adds simultaneously Enter Custom Dictionaries, text is split into word one by one by part of speech；Then stop words and special is carried out to result after participle Symbol processing because by many of the text that parses after participle stop word and additional character (as ,), It will affect cluster result in clustering.This is stopped useless by the deactivated dictionary of introducing and additional character table the step for Word and additional character are removed；It is based on previous step Text Pretreatment as a result, to Text Feature Extraction Feature Words.The present embodiment uses word Frequently-inverse article word frequency TF-IDF carries out Feature Words extraction to article；It is extracted based on previous step text feature word as a result, constructing text Feature vector, i.e. word vector indicate text, for using when follow-up text similarity system design.Every article is arranged by TF-IDF value Sequence chooses the maximum t word in front and forms feature word list, and the feature vector of every article has by comparing with feature word list It then adds, does not add 0 then, constitute the respective feature vector of every article；Judge to input whether amount of text meets parameter Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.

Refering to Fig. 3.Mixing increment clustering processing process provided in this embodiment, detailed step are as follows:

It clusters in batches: N number of article being divided into several according to batchsize, effect is to do batch cohesion for every batch of data Class processing, so needing to limit the quantity (setting in present lot has m articles) of article in batch:

Subject Clustering in batch: doing Subject Clustering for the Feature Words of article all in each batch, is equivalent to these texts The content of the Feature Words composition of chapter does subject extraction, obtains some themes, this m article is associated with these themes, is every One article selects a maximally related theme, if independently generating free theme without related subject.It is adopted in the present embodiment It uses LDA algorithm to generate some initial subjects to newsletter archive in batch: here using word as characteristic item, text being regarded as by spy Levy what word was constituted, every text can include multiple themes, and correspond to each theme with different probability, i.e. every text is corresponding One theme probability distribution, each theme correspond to different Word probability distributions, and each word of every text is with different probability pair Answer some theme.M articles in batch, each document d can regard a word sequence w=< w as in m articles₁, w₂,...,w_n>, w_iIndicate the word in i-th of lexical item text.

Some initial subjects are generated to newsletter archive in batch using LDA algorithm: using word as characteristic item, text being regarded as It is to be made of Feature Words, the corresponding theme probability distribution of every text, each word of every text is with different probability pair Answer each theme；T represents theme corresponding to the Feature Words of each text, k theme < t₁,t₂,...,t_k> using P (word | text Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table Show two vectors of the ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themes_d<p_t1,p_t2,..., p_tk>, the probability of various words is generated to each themeObtain document classification theme result, wherein p_tiIndicate the probability of corresponding i-th of the theme of document d, p_wiIndicate the probability of i-th of word of generation.

During Subject Clustering, this m article is associated with to these themes, selects a most phase for each article The theme of pass has some articles that will not be associated with any theme, because having some keywords since its effect is not strong, or crucial The parameter of term clustering is improper, is filtered, and this article is caused not to be associated with one keyword of any of the above, it is considered that It is to be associated with a free theme；

Hierarchical clustering in theme segments theme: doing a top-down division for all articles in each theme Article in theme is gradually subdivided into smaller and smaller cluster, used in cluster process based on the phase between article by formula hierarchical clustering It include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. gather article like degree Class, take wherein similarity threshold controlled by wordSimThreshold, meet similarity threshold then using text as a cluster.

The present embodiment similarity uses cosine similarity calculation formula (1), is sentenced by calculating the angle between text vector The similarity degree of disconnected vector, angle is smaller, and it is more similar to represent two articles.The step main purpose is used to carry out theme thin Point, it is really achieved and a coarseness theme is subdivided into the different themes with obvious distinction, improve the accurate of cluster result Property；Similarity calculation equally uses formula between theme:

N represents the feature sum of Text eigenvector A and Text eigenvector B in formula, and Ai indicates the in Text eigenvector A I feature, Bi indicate the ith feature in Text eigenvector B, and cos θ indicates the angle between two text vectors.Theme Between hierarchical clustering, merge theme: for all clusters being previously obtained, the bottom-up coagulation type stratification that tries again is poly- Class (is equivalent to the cluster across theme), merges and meets the class cluster of Topic Similarity threshold value, and cluster process is based on title similarity, just Literary similarity, name entity similarity, non-name entity similarity, threshold value are controlled by dficfSimThreshold.Step master It further to be promoted poly- to solve a case where real theme is split out in the theme released as initialization The accuracy of class result；

Single-point theme is reduced, single-point theme merges: to some small clusters obtained in process above (including single-point and non-list Point) merger is done, the class cluster of these single-points towards non-single-point is next done merger, the process of merger is, for each single-point, Selection and their relatively more similar articles (similarity > innerBatchSPKNNThreshold article) in this m article, Then seeing the single-point by ballot method, similar article quantity is most to which class cluster, and finally which class cluster selection should be grouped into In.The step is mainly used for solving the problems, such as that there are excessive single-point class clusters, and by further merging, practical related single-point is closed And both solve the problems, such as more single-point class clusters, further improve the accuracy of cluster result；

Single-point theme merges between batch: the cluster obtained for every a batch, these clusters can be divided into two classes, single-point and non-list again Point, try again the process similar with 4.4.Here it is not both with 4.4, since 4.4 be do on m articles, the size of m It is conditional, it is possible to the similarity with every article be calculated to each single-point, but worst case may be n here A cluster, n are a very big numbers, so only calculating its distance to each cluster centre, choosing to each single-point here Select maximum one (threshold value is given by crossBatchSPThreshold) more than threshold value.Further by single-point between batch into Row fusion, reduces single-point class cluster, promotes cluster result accuracy；

Cluster between batch: the result (i.e. a group cluster) for having passed over standalone processes tries again the bottom of between them Upward coagulation type hierarchical clustering, threshold value are similarly dficfSimThreshold.The purpose of this processing step be due to It is to have done to have no basis in batches to article in step 3, the cluster across batch is needed exist for agglomerate mutually.

Generate new topic: finally to keyword in cluster result by weight sequencing, w word is for indicating that the cluster is accumulate before taking The staple of conversation contained.If you need to continue the step of monitoring carries out topic detection and discovery, then persistently receives data, repeat front.

New class cluster fusion: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class cluster Merge with existing history class cluster, is otherwise used as new class cluster.

The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention, Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.

Claims

1. a kind of news topic automatic discovering method, it is characterised in that include the following steps: that setting increment cluster correlation first is joined Several and increment Clustering Trigger parameter, clusters incremental data in batches, carries out text to input text in batch and locates in advance Reason operation, obtains batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character and stop words, Text feature is calculated, Text eigenvector is generated, extracts text feature word, constructs Text eigenvector collection, and in batch first Subject Clustering is done, then does theme internal layer time cluster, theme is segmented, then between hierarchical clustering theme, for what is be previously obtained All clusters merge theme: try again bottom-up coagulation type hierarchical clustering, for the institute in each theme There is article to do a top-down Split type hierarchical clustering, article in theme is gradually subdivided into smaller and smaller cluster, it is right Small cluster obtained above carries out single-point theme and merges with non-single-point merger, the article similar in selection theme in m articles, Single-point theme merges between doing batch, calculates the similarity of each single-point theme article and all clusters, i.e., each single-point arrives The distance of each cluster centre, the maximum class cluster for selecting more than threshold value, which is done, to be merged, and then carries out clustering between batch, right Try again bottom-up coagulation type hierarchical clustering between a group cluster by standalone processes result, will be poly- across batch Class agglomerates mutually, completes hierarchical clustering between theme；To cluster result, according to keyword weight sequencing, extraction represent one newly again The highest previous group word of topic weight is heard, news topic is generated and carries out new class cluster fusion, by new class cluster mass center and existing history Class cluster mass center is compared, and is met threshold value and is then merged new class cluster with existing history class cluster, and new class cluster is otherwise used as；Then to new Increasing data clusters result is done with existing cluster result to be merged across batch.

2. news topic automatic discovering method as described in claim 1, it is characterised in that: produced in disposable cluster process Raw single-point class cluster does further fusion treatment, does to newly-increased data clusters result with existing cluster result and merges across batch, most The accuracy of topic is improved eventually and guarantees the continuity of topic.

3. news topic automatic discovering method as described in claim 1, it is characterised in that: increment cluster correlation parameter setting packet Include: single batch clusters amount of text batchsize, single batch clusters time interval, hierarchical clustering similarity threshold in theme Hierarchical clustering similarity threshold between wordSimThreshold, theme, in batch single-point cluster compared with having and clustering class cluster Single-point clusters the similarity threshold compared with having cluster class cluster between single-point similarity threshold and batch.

4. news topic automatic discovering method as described in claim 1, it is characterised in that: in cluster in batches, by N texts Chapter is divided into several small quantities of, does clustering processing in batch for each small batch data, if there is m articles in present lot, and is criticized Secondary interior Subject Clustering: by each it is small quantities of in the Feature Words of all article do Subject Clustering, some themes are obtained, by this m Article is associated with these themes, selects a maximally related theme for each article, if without related subject, it is independent Generate free theme；Hierarchical clustering and subdivision theme in theme are carried out again.

5. news topic automatic discovering method as described in claim 1, it is characterised in that: include based on the similarity between article Title similarity, text similarity, name entity similarity, non-name entity similarity refine article theme.

6. news topic automatic discovering method as described in claim 1, it is characterised in that: unified text encoding format, it will be defeated Enter data format and uniformly switch to UTF-8 or GBK Chinese character code, Chinese word segmentation processing is carried out to the text data of format after reunification, It is segmented using jieba, while Custom Dictionaries is added, text is split into word one by one by part of speech；Then to being tied after participle Fruit carries out stop words and additional character processing, introduces and deactivates dictionary and additional character table for useless stop words and additional character Remove；It is based on Text Pretreatment as a result, to Text Feature Extraction Feature Words.

7. news topic automatic discovering method as described in claim 1, it is characterised in that: using the inverse article word frequency TF- of word frequency- IDF carries out Feature Words extraction to article；It is extracted based on text feature word as a result, constructing Text eigenvector, i.e. word vector table Show text, for using when follow-up text similarity system design.

8. news topic automatic discovering method as described in claim 1, it is characterised in that: every article sorts by TF-IDF value It chooses the maximum t word in front and forms feature word list, the feature vector of every article has then by comparing with feature word list Addition does not add 0 then, constitutes the respective feature vector of every article；Judge to input whether amount of text meets parameter Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.

9. news topic automatic discovering method as described in claim 1, it is characterised in that: using LDA algorithm to new in batch It hears some initial subjects of text generation: using word as characteristic item, text being regarded as and is made of Feature Words, every text correspondence One theme probability distribution, each theme correspond to different Word probability distributions, k theme < t₁,t₂,...,t_k> using P (word | text Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table Show two vectors of ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themes_d<p_t1,p_t2,...,p_tk >, the probability of various words is generated to each themeObtain document classification theme result, wherein t generation Theme corresponding to the Feature Words of each text of table, p_tiIndicate the probability of corresponding i-th of the theme of document d, p_wiIt indicates to generate i-th The probability of a word.

10. news topic automatic discovering method as described in claim 1, it is characterised in that: in theme subdivision, similarity is used Cosine similarity calculation formula (2) calculates the angle between text vector to judge the similarity degree of vector, similarity meter between theme It calculates and equally uses formula:

In formula, θ indicates that the angle between two text vectors, n represent the feature of Text eigenvector A and Text eigenvector B Sum, Ai indicate the ith feature in Text eigenvector A, and Bi indicates the ith feature in Text eigenvector B.