CN109710728A - News topic automatic discovering method - Google Patents

News topic automatic discovering method Download PDF

Info

Publication number
CN109710728A
CN109710728A CN201811417992.2A CN201811417992A CN109710728A CN 109710728 A CN109710728 A CN 109710728A CN 201811417992 A CN201811417992 A CN 201811417992A CN 109710728 A CN109710728 A CN 109710728A
Authority
CN
China
Prior art keywords
cluster
text
theme
batch
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811417992.2A
Other languages
Chinese (zh)
Other versions
CN109710728B (en
Inventor
崔莹
代翔
刘宇
王侃
彭易锦
黄细凤
丁洪丽
杨露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN201811417992.2A priority Critical patent/CN109710728B/en
Publication of CN109710728A publication Critical patent/CN109710728A/en
Application granted granted Critical
Publication of CN109710728B publication Critical patent/CN109710728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of news topic automatic discovering method disclosed by the invention, it is desirable to provide a method of it can be improved the accuracy of news topic discovery.The technical scheme is that: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is clustered in batches, pretreatment input text, text formatting coding is unified to article, calculate text feature, generate Text eigenvector, extract text feature word, construct Text eigenvector collection, Subject Clustering is first done in batch, theme internal layer time cluster is done again, then each single-point theme is calculated at a distance from similarity i.e. each single-point to each cluster centre of all clusters, it is merged into maximum class cluster, cluster across batch is agglomerated mutually, hierarchical clustering between completion theme;It generates news topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, newly-increased data clusters result is done with existing cluster result then and is merged across batch.

Description

News topic automatic discovering method
Technical field
The present invention relates to Text Mining Technology field more particularly to a kind of excavations of hot news topic, mixed based on multistep The news topic for closing increment cluster finds method.
Background technique
Internet news is one that people obtain news messages as one of information type most important in the Internet media Important channel.Internet is related to the various aspects of life, and webpage information is being increased in a manner of blowout on internet, newly It hears data and shows magnanimity, facing to the webpage information of the magnanimity so to grow at top speed, for common netizen, it is desirable to search The network information required for individual becomes more and more difficult, for government department, timely and accurately finds common net The hot topic of people's care simultaneously controls network public-opinion, has become a critically important Men Xueke.It is new due to what is increased severely on network The characteristics such as many and diverse redundancy and redundancy are heard, the workload that news topic is searched by manually is huge.To people from magnanimity news It obtains information needed and brings huge challenge.Although the portal website of domestic some large sizes is often to most of within certain time The hot topic and focus event that ordinary user pays close attention to, delivery network Special Topics in Journalism, but website still mainly passes through at this stage It is manually screened and is edited, relevant news is put into a special news, this mode has very big drawback, not It can effectively meet the needs of users in time, and often the personal view containing web editor is inclined to for the inside, it cannot be completely objective See the truth of neutral zone reaction event.Therefore, it how fast and accurately to be obtained in the webpage information from these in terms of tens billion of general The popular information of logical netizen's concern and hot topic, the major event especially occurred in the recent period have become common netizen Demand true.Wherein, event (Event) refers at what time, what place, is related to the spy what personage occurs Determine event.Topic (Topic) is regarded it as by an emergency event and the dependent event caused by this emergency event forms, can To think set that a topic is made of multiple events.Report (Story) is the related news report to some event, main Report the contents such as Internet news contribution, TV news report segment and Broadcast Journalism casting.The cutting of news report just needs This section of news report is cut into, the Reporting about different topics is divided into independent news report segment.Newspeak The main task of the detection TD of topic is to find and identify the hot spot thing of newest generation from the news report information flow generated in real time Part and hot topic.Classical topic detection and tracking system be all built upon the content of text of news report be transformed into On this basis quantity space model VSM, but VSM can bring noise information when carrying out text representation, this can give new thing The detection of part brings error.The topic detection system of actual use is generally required in real time increase news report document Amount cluster, it is desirable that can rapidly and accurately find the event and sudden critical event of kainogenesis.The digging of hot news topic Pick has more application value relative to topic detection and tracking, the automatic discovery of hot news topic.Traditional topic finds master Be directed to long text and news data collection, extensive short text have the characteristics that it is sparse, without structure, make an uproar more, conventional method is very Difficult effectively discovery topic., all there are some problems, such as focus incident in present existing hot news topic discovery system The phenomenon that accuracy rate of detection is not high, and mistake is divided is very universal, and most of hot news event only carries out news report It simply enumerates, there is no the extractions that these focus incidents are carried out with Feature Words and summary info, this is unfavorable for the news of user very much Browsing and news are read.The focus incident topic of many portal websites be mostly by manually being edited and being organized, This mode can only choose those one month or 1 year hot topic, can not issue newest unexpected incidents in time, Real-time is not strong.Although it is interested that topic discovery technique can help people quickly to obtain from internet information resource abundant Information, help government, enterprises and institutions to grasp newest public sentiment dynamic in time.But it can not traditional file classification method is straight Scoop out and use in topic detection system, news report document handled by topic detection and tracking be dynamic it is increased, with when Between passage, news documents are gradually increasing, if directly text classification or cluster using the increased news text of these dynamics On shelves, number of documents is excessive, and the Space-time Complexity needed is excessively high, this needs to carry out existing text classification and clustering algorithm The characteristics of improvement, Lai Shiying topic detection and tracking.
Topic is the discovery that finds topic from magnanimity news data automatically, is that index combs data with topic, will Content related with topic is polymerize, is organized together.Topic finds that method is substantially by way of increment cluster by magnanimity Text in text data with Similar content gathers in identical theme, so that topic height is similar between text in same class, Topic similarity is low between different class texts.The method of traditional text increment cluster is broadly divided into two classes:
First kind method is that all data are iterated cluster every time, and certain interval of time again carries out all data primary Property cluster, advantage is precision height, the disadvantage is that the cluster result of front cannot be utilized, waste of resource, each cluster result cannot be protected Demonstrate,prove consistency;Second class method is the cluster result before utilizing, and newly-increased data are divided into class nearest from it in existing class cluster In cluster, and the cluster mass center is recalculated, advantage is not need to recalculate cluster to all data every time, the disadvantage is that with class cluster Continuous increase, class cluster is easy to happen mass center drift, and topic can not possess duration, and because be by new data and existing class cluster Similitude comparison is carried out, therefore new class cluster can not be generated, it is low to generate topic accuracy rate.Meanwhile existing increment clustering method leads to Cluster task often is completed only with a kind of clustering method, therefore there is problems: because cluster is a kind of unsupervised learning Method, needs specified initial clustering number, and conventional method is not accounted for due to randomly choosing cluster matter during clustering initialization The heart leads to have the problem of similitude between cluster class cluster that may be present, i.e., the data for obviously belonging to same topic are divided into not With in topic;Further fusion treatment is not done to the single-point class cluster generated in disposable cluster process, leads to that there are multiple It only include the topic of single text.
A kind of mixing increment clustering method is applied in news topic discovery analysis by present invention research, and technology is conceived to solution Certainly above-mentioned increment clustering method drawback present in topic discovery procedure, and propose corresponding increment cluster topic discovery analysis Process.
Summary of the invention
Object of the present invention is to be conceived to solve above-mentioned increment clustering method drawback present in topic discovery procedure, provide One kind corresponding to increment cluster topic and based on mixing increment cluster, realize to the topic tissue of magnanimity news data and discovery, The continuity for keeping existing topic improves the news topic automatic discovering method of the accuracy of topic discovery,
Above-mentioned purpose of the invention can realize that a kind of news topic automatic discovering method, feature exists by following measures In including the following steps: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is carried out in batches Secondary cluster carries out Text Pretreatment operation to input text in batch, obtains batch of articles N, unify text formatting to it It encodes, carries out Chinese word segmentation, goes additional character and stop words, then calculate text feature, generate Text eigenvector, extract text Eigen word constructs Text eigenvector collection, and first does Subject Clustering in batch, then is theme internal layer time cluster, subdivision master All clusters being previously obtained then between hierarchical clustering theme, are merged theme: tried again bottom-up by topic Coagulation type hierarchical clustering, it is poly- that a top-down Split type stratification is done for all articles in each theme Article in theme is gradually subdivided into smaller and smaller cluster by class, to small cluster obtained above carry out single-point theme merge and Non- single-point merger selects relatively more similar article in m articles, and single-point theme merges between doing batch, calculates each single-point Every article of theme selects more than threshold value at a distance from the similarity of all clusters and each single-point to each cluster centre A maximum class cluster do and merge, then carry out clustering between batch, between the group cluster Jing Guo standalone processes result again Primary bottom-up coagulation type hierarchical clustering is done, the cluster across batch is agglomerated mutually, level is poly- between completing theme Class;To cluster result, according to keyword weight sequencing, extraction represent the highest previous group word of a news topic weight again, generate new It hears topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, meeting threshold value then will be new Class cluster merges with existing history class cluster, is otherwise used as new class cluster;Then newly-increased data clusters result is done with existing cluster result Across batch fusion.
The beneficial effects of the present invention are:
The present invention be different from traditional news media topic find method, the increment clustering method of use, by magnanimity news data into Row batch is divided, is clustered in batches, cluster result merges in batch, cluster result melts between the fusion of single-point topic, batch in batch Close etc. multisteps clustering processing, using multistep cluster, merge, cluster again by the way of improve resource utilization and topic discovery result Accuracy.By being clustered in batches to incremental data, so that cluster result can use existing cluster result every time, no It needs to do clustering processing again to all data every time, economize on resources, and guarantee the consistency of cluster result.
The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, achievees the purpose that segment theme, protect Similitude is low between demonstrate,proving theme, and independence is strong.Further by hierarchical clustering between theme, solve due to during clustering initialization with The problem of same topic caused by machine selection cluster mass center is split.By to the single-point class generated in disposable cluster process Cluster does further fusion treatment, solves the topical questions there are multiple comprising single text.To newly-increased data clusters result with Existing cluster result does across batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.
Cluster result consistency.The present invention passes through setting increment Clustering Trigger parameter first, carries out in batches to incremental data Secondary cluster does not need every time to do at cluster all data again so that cluster result can use existing cluster result every time Reason, economizes on resources, and guarantee the consistency of cluster result.
Theme independence.The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, reaches subdivision master The purpose of topic, similitude is low between guaranteeing theme, and independence is strong.
Theme cutting.Hierarchical clustering is solved due to choosing random during clustering initialization between the present invention further passes through theme Select the problem of same topic caused by cluster mass center is split.
Single-point theme.The present invention is by doing at further fusion the single-point class cluster generated in disposable cluster process Reason solves the topical questions there are multiple comprising single text.To newly-increased data clusters result and existing cluster result do across Batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.
Detailed description of the invention
Fig. 1 is that news topic of the present invention finds processing flow schematic diagram automatically.
Fig. 2 is the schematic diagram of Fig. 1 Text Pretreatment process.
Fig. 3 is the embodiment flow diagram of multistep mixing increment clustering processing of the present invention.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair It is bright to be described in further detail.
Specific embodiment
Refering to fig. 1.According to the present invention, increment cluster correlation parameter and increment Clustering Trigger parameter are set first, to increment Data are clustered in batches;Text Pretreatment operation is carried out to input text, batch of articles N is obtained, text is unified to it Said shank carries out Chinese word segmentation, goes additional character and stop words, and first doing Subject Clustering in batch, then does theme internal layer Secondary cluster completes hierarchical clustering between theme, then calculates text feature, generates Text eigenvector;Extract text feature word, structure Build Text eigenvector collection;Judge whether input text data amount meets single batch cluster amount of text batchsize or time Whether interval meets single batch cluster time interval timeout, if reaching single batch cluster time interval, list has not been reached yet Batch clusters amount of text, then cluster is automatically begun to, and does to the single-point class cluster generated in disposable cluster process and further melts Conjunction processing carries out mixing increment clustering processing to input text, generates news topic;To cluster result, according to keyword weight is arranged Sequence extracts the highest previous group word of weight to represent a news topic.By new class cluster mass center and existing history class cluster mass center It is compared, meets threshold value and then merge new class cluster with existing history class cluster, and carry out new class cluster fusion, be otherwise used as new class Cluster, and new class cluster fusion is carried out, newly-increased data clusters result is done with existing cluster result and is merged across batch, then cluster is tied Fruit takes keyword to indicate topic;Judge whether to continue with again, otherwise terminate, be, persistently receive data, returns to input text This progress Text Pretreatment operation executes circulate operation to terminating.The single-point class cluster generated in disposable cluster process is done Further fusion treatment is done newly-increased data clusters result with existing cluster result and is merged across batch, the final standard for improving topic True property and the continuity for guaranteeing topic.
Increment cluster correlation parameter setting includes: single batch cluster amount of text batchsize, between the single batch cluster time Every hierarchical clustering similarity threshold between hierarchical clustering similarity threshold wordSimThreshold, theme in timeout, theme Single-point clusters the single-point similarity threshold compared with having cluster class cluster in value wordSimThreshold, batch Single-point clusters the similarity threshold compared with having cluster class cluster between innerBatchSPKNNThreshold and batch crossBatchSPThreshold。
In the reason process of following embodiment, using following steps:
[Text Pretreatment]: obtaining batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character With stop words, calculating text feature, Text eigenvector is generated;
[judgement of automatic cluster condition]: judge to input whether amount of text meets single batch cluster amount of text, if reached Time interval is clustered to single batch, single batch cluster amount of text has not been reached yet, then cluster is automatically begun to;
[in batches cluster]: by N articles be divided into it is several in small batches, clustering processing in batch is done for each small batch data, if currently Have m articles in batch, and carry out Subject Clustering in batch: by each it is small quantities of in the Feature Words of all article do theme and gather Class obtains some themes, this m article is associated with these themes, selects a maximally related theme for each article, If independently generating free theme without related subject;Hierarchical clustering in theme is carried out again, segments theme: being based on article Between similarity include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. refine article Theme;One top-down Split type hierarchical clustering is done for all articles in each theme, then between theme Hierarchical clustering, merge theme: for all clusters being previously obtained, try again bottom-up coagulation type hierarchical clustering (being equivalent to the cluster across theme) solves same subject and is split as multiple themes;Single-point theme is reduced, single-point theme merges: Merger is done to some small clusters obtained in process above, including single-point and non-single-point, for each single-point, at this m Selection and their relatively more similar articles, solve the problems, such as that single-point theme is excessive in article;
[between batch single-point theme merge]: the cluster obtained for every a batch, these clusters can be divided into two classes again, single-point and non- Single-point, tries again and similar process, the similarity of every article Yu all clusters is calculated each single-point theme, to every A single-point calculates the distance that it arrives each cluster centre, and the maximum class cluster for selecting more than threshold value, which is done, to be merged;
[clustering between batch]: the result (i.e. a group cluster) for passing over standalone processes tries again the bottom of between them Upward coagulation type hierarchical clustering, this step are needed here due to being to have done to have no basis in batches to article in step 3 Cluster across batch is agglomerated mutually.
[generate news topic]: to cluster result according to keyword weight sequencing, extract the highest previous group word of weight to Represent a news topic.
[new class cluster fusion]: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class Cluster merges with existing history class cluster, is otherwise used as new class cluster.
Refering to Fig. 2.By segmenting in this present embodiment using jieba, jieba participle is to processing text formatting requirement GBK coding, therefore input data is converted to GBK coded format by the present embodiment unification, and UTF-8 format is used after word segmentation processing It saves.Text Pretreatment process provided in this embodiment carries out pretreatment operation to input text.Specifically include the following steps:
All news documents are initialized as a major class cluster first, the division of class cluster is then carried out, major class cluster is split into Group cluster, until the threshold value of one of class cluster meets preset threshold value.Unified text encoding format, input data format is united One switchs to the Chinese character codes such as UTF-8 or GBK, and actual coding format file format according to needed for participle step is arranged;It unites to format Text data after one carries out Chinese word segmentation processing, and as described in previous step, the present embodiment is segmented using jieba, adds simultaneously Enter Custom Dictionaries, text is split into word one by one by part of speech;Then stop words and special is carried out to result after participle Symbol processing because by many of the text that parses after participle stop word and additional character (as ,), It will affect cluster result in clustering.This is stopped useless by the deactivated dictionary of introducing and additional character table the step for Word and additional character are removed;It is based on previous step Text Pretreatment as a result, to Text Feature Extraction Feature Words.The present embodiment uses word Frequently-inverse article word frequency TF-IDF carries out Feature Words extraction to article;It is extracted based on previous step text feature word as a result, constructing text Feature vector, i.e. word vector indicate text, for using when follow-up text similarity system design.Every article is arranged by TF-IDF value Sequence chooses the maximum t word in front and forms feature word list, and the feature vector of every article has by comparing with feature word list It then adds, does not add 0 then, constitute the respective feature vector of every article;Judge to input whether amount of text meets parameter Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.
Refering to Fig. 3.Mixing increment clustering processing process provided in this embodiment, detailed step are as follows:
It clusters in batches: N number of article being divided into several according to batchsize, effect is to do batch cohesion for every batch of data Class processing, so needing to limit the quantity (setting in present lot has m articles) of article in batch:
Subject Clustering in batch: doing Subject Clustering for the Feature Words of article all in each batch, is equivalent to these texts The content of the Feature Words composition of chapter does subject extraction, obtains some themes, this m article is associated with these themes, is every One article selects a maximally related theme, if independently generating free theme without related subject.It is adopted in the present embodiment It uses LDA algorithm to generate some initial subjects to newsletter archive in batch: here using word as characteristic item, text being regarded as by spy Levy what word was constituted, every text can include multiple themes, and correspond to each theme with different probability, i.e. every text is corresponding One theme probability distribution, each theme correspond to different Word probability distributions, and each word of every text is with different probability pair Answer some theme.M articles in batch, each document d can regard a word sequence w=< w as in m articles1, w2,...,wn>, wiIndicate the word in i-th of lexical item text.
Some initial subjects are generated to newsletter archive in batch using LDA algorithm: using word as characteristic item, text being regarded as It is to be made of Feature Words, the corresponding theme probability distribution of every text, each word of every text is with different probability pair Answer each theme;T represents theme corresponding to the Feature Words of each text, k theme < t1,t2,...,tk> using P (word | text Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table Show two vectors of the ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themesd<pt1,pt2,..., ptk>, the probability of various words is generated to each themeObtain document classification theme result, wherein ptiIndicate the probability of corresponding i-th of the theme of document d, pwiIndicate the probability of i-th of word of generation.
During Subject Clustering, this m article is associated with to these themes, selects a most phase for each article The theme of pass has some articles that will not be associated with any theme, because having some keywords since its effect is not strong, or crucial The parameter of term clustering is improper, is filtered, and this article is caused not to be associated with one keyword of any of the above, it is considered that It is to be associated with a free theme;
Hierarchical clustering in theme segments theme: doing a top-down division for all articles in each theme Article in theme is gradually subdivided into smaller and smaller cluster, used in cluster process based on the phase between article by formula hierarchical clustering It include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. gather article like degree Class, take wherein similarity threshold controlled by wordSimThreshold, meet similarity threshold then using text as a cluster.
The present embodiment similarity uses cosine similarity calculation formula (1), is sentenced by calculating the angle between text vector The similarity degree of disconnected vector, angle is smaller, and it is more similar to represent two articles.The step main purpose is used to carry out theme thin Point, it is really achieved and a coarseness theme is subdivided into the different themes with obvious distinction, improve the accurate of cluster result Property;Similarity calculation equally uses formula between theme:
N represents the feature sum of Text eigenvector A and Text eigenvector B in formula, and Ai indicates the in Text eigenvector A I feature, Bi indicate the ith feature in Text eigenvector B, and cos θ indicates the angle between two text vectors.Theme Between hierarchical clustering, merge theme: for all clusters being previously obtained, the bottom-up coagulation type stratification that tries again is poly- Class (is equivalent to the cluster across theme), merges and meets the class cluster of Topic Similarity threshold value, and cluster process is based on title similarity, just Literary similarity, name entity similarity, non-name entity similarity, threshold value are controlled by dficfSimThreshold.Step master It further to be promoted poly- to solve a case where real theme is split out in the theme released as initialization The accuracy of class result;
Single-point theme is reduced, single-point theme merges: to some small clusters obtained in process above (including single-point and non-list Point) merger is done, the class cluster of these single-points towards non-single-point is next done merger, the process of merger is, for each single-point, Selection and their relatively more similar articles (similarity > innerBatchSPKNNThreshold article) in this m article, Then seeing the single-point by ballot method, similar article quantity is most to which class cluster, and finally which class cluster selection should be grouped into In.The step is mainly used for solving the problems, such as that there are excessive single-point class clusters, and by further merging, practical related single-point is closed And both solve the problems, such as more single-point class clusters, further improve the accuracy of cluster result;
Single-point theme merges between batch: the cluster obtained for every a batch, these clusters can be divided into two classes, single-point and non-list again Point, try again the process similar with 4.4.Here it is not both with 4.4, since 4.4 be do on m articles, the size of m It is conditional, it is possible to the similarity with every article be calculated to each single-point, but worst case may be n here A cluster, n are a very big numbers, so only calculating its distance to each cluster centre, choosing to each single-point here Select maximum one (threshold value is given by crossBatchSPThreshold) more than threshold value.Further by single-point between batch into Row fusion, reduces single-point class cluster, promotes cluster result accuracy;
Cluster between batch: the result (i.e. a group cluster) for having passed over standalone processes tries again the bottom of between them Upward coagulation type hierarchical clustering, threshold value are similarly dficfSimThreshold.The purpose of this processing step be due to It is to have done to have no basis in batches to article in step 3, the cluster across batch is needed exist for agglomerate mutually.
Generate new topic: finally to keyword in cluster result by weight sequencing, w word is for indicating that the cluster is accumulate before taking The staple of conversation contained.If you need to continue the step of monitoring carries out topic detection and discovery, then persistently receives data, repeat front.
New class cluster fusion: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class cluster Merge with existing history class cluster, is otherwise used as new class cluster.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention, Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.

Claims (10)

1. a kind of news topic automatic discovering method, it is characterised in that include the following steps: that setting increment cluster correlation first is joined Several and increment Clustering Trigger parameter, clusters incremental data in batches, carries out text to input text in batch and locates in advance Reason operation, obtains batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character and stop words, Text feature is calculated, Text eigenvector is generated, extracts text feature word, constructs Text eigenvector collection, and in batch first Subject Clustering is done, then does theme internal layer time cluster, theme is segmented, then between hierarchical clustering theme, for what is be previously obtained All clusters merge theme: try again bottom-up coagulation type hierarchical clustering, for the institute in each theme There is article to do a top-down Split type hierarchical clustering, article in theme is gradually subdivided into smaller and smaller cluster, it is right Small cluster obtained above carries out single-point theme and merges with non-single-point merger, the article similar in selection theme in m articles, Single-point theme merges between doing batch, calculates the similarity of each single-point theme article and all clusters, i.e., each single-point arrives The distance of each cluster centre, the maximum class cluster for selecting more than threshold value, which is done, to be merged, and then carries out clustering between batch, right Try again bottom-up coagulation type hierarchical clustering between a group cluster by standalone processes result, will be poly- across batch Class agglomerates mutually, completes hierarchical clustering between theme;To cluster result, according to keyword weight sequencing, extraction represent one newly again The highest previous group word of topic weight is heard, news topic is generated and carries out new class cluster fusion, by new class cluster mass center and existing history Class cluster mass center is compared, and is met threshold value and is then merged new class cluster with existing history class cluster, and new class cluster is otherwise used as;Then to new Increasing data clusters result is done with existing cluster result to be merged across batch.
2. news topic automatic discovering method as described in claim 1, it is characterised in that: produced in disposable cluster process Raw single-point class cluster does further fusion treatment, does to newly-increased data clusters result with existing cluster result and merges across batch, most The accuracy of topic is improved eventually and guarantees the continuity of topic.
3. news topic automatic discovering method as described in claim 1, it is characterised in that: increment cluster correlation parameter setting packet Include: single batch clusters amount of text batchsize, single batch clusters time interval, hierarchical clustering similarity threshold in theme Hierarchical clustering similarity threshold between wordSimThreshold, theme, in batch single-point cluster compared with having and clustering class cluster Single-point clusters the similarity threshold compared with having cluster class cluster between single-point similarity threshold and batch.
4. news topic automatic discovering method as described in claim 1, it is characterised in that: in cluster in batches, by N texts Chapter is divided into several small quantities of, does clustering processing in batch for each small batch data, if there is m articles in present lot, and is criticized Secondary interior Subject Clustering: by each it is small quantities of in the Feature Words of all article do Subject Clustering, some themes are obtained, by this m Article is associated with these themes, selects a maximally related theme for each article, if without related subject, it is independent Generate free theme;Hierarchical clustering and subdivision theme in theme are carried out again.
5. news topic automatic discovering method as described in claim 1, it is characterised in that: include based on the similarity between article Title similarity, text similarity, name entity similarity, non-name entity similarity refine article theme.
6. news topic automatic discovering method as described in claim 1, it is characterised in that: unified text encoding format, it will be defeated Enter data format and uniformly switch to UTF-8 or GBK Chinese character code, Chinese word segmentation processing is carried out to the text data of format after reunification, It is segmented using jieba, while Custom Dictionaries is added, text is split into word one by one by part of speech;Then to being tied after participle Fruit carries out stop words and additional character processing, introduces and deactivates dictionary and additional character table for useless stop words and additional character Remove;It is based on Text Pretreatment as a result, to Text Feature Extraction Feature Words.
7. news topic automatic discovering method as described in claim 1, it is characterised in that: using the inverse article word frequency TF- of word frequency- IDF carries out Feature Words extraction to article;It is extracted based on text feature word as a result, constructing Text eigenvector, i.e. word vector table Show text, for using when follow-up text similarity system design.
8. news topic automatic discovering method as described in claim 1, it is characterised in that: every article sorts by TF-IDF value It chooses the maximum t word in front and forms feature word list, the feature vector of every article has then by comparing with feature word list Addition does not add 0 then, constitutes the respective feature vector of every article;Judge to input whether amount of text meets parameter Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.
9. news topic automatic discovering method as described in claim 1, it is characterised in that: using LDA algorithm to new in batch It hears some initial subjects of text generation: using word as characteristic item, text being regarded as and is made of Feature Words, every text correspondence One theme probability distribution, each theme correspond to different Word probability distributions, k theme < t1,t2,...,tk> using P (word | text Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table Show two vectors of ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themesd<pt1,pt2,...,ptk >, the probability of various words is generated to each themeObtain document classification theme result, wherein t generation Theme corresponding to the Feature Words of each text of table, ptiIndicate the probability of corresponding i-th of the theme of document d, pwiIt indicates to generate i-th The probability of a word.
10. news topic automatic discovering method as described in claim 1, it is characterised in that: in theme subdivision, similarity is used Cosine similarity calculation formula (2) calculates the angle between text vector to judge the similarity degree of vector, similarity meter between theme It calculates and equally uses formula:
In formula, θ indicates that the angle between two text vectors, n represent the feature of Text eigenvector A and Text eigenvector B Sum, Ai indicate the ith feature in Text eigenvector A, and Bi indicates the ith feature in Text eigenvector B.
CN201811417992.2A 2018-11-26 2018-11-26 Automatic news topic discovery method Active CN109710728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811417992.2A CN109710728B (en) 2018-11-26 2018-11-26 Automatic news topic discovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811417992.2A CN109710728B (en) 2018-11-26 2018-11-26 Automatic news topic discovery method

Publications (2)

Publication Number Publication Date
CN109710728A true CN109710728A (en) 2019-05-03
CN109710728B CN109710728B (en) 2022-05-17

Family

ID=66255118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811417992.2A Active CN109710728B (en) 2018-11-26 2018-11-26 Automatic news topic discovery method

Country Status (1)

Country Link
CN (1) CN109710728B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110362685A (en) * 2019-07-22 2019-10-22 腾讯科技(武汉)有限公司 Clustering method and cluster equipment
CN110377728A (en) * 2019-06-06 2019-10-25 上海星济信息科技有限公司 Lteral data processing method, system, medium and device
CN110377695A (en) * 2019-06-17 2019-10-25 广州艾媒数聚信息咨询股份有限公司 A kind of public sentiment subject data clustering method, device and storage medium
CN110489558A (en) * 2019-08-23 2019-11-22 网易传媒科技(北京)有限公司 Polymerizable clc method and apparatus, medium and calculating equipment
CN111104511A (en) * 2019-11-18 2020-05-05 腾讯科技(深圳)有限公司 Method and device for extracting hot topics and storage medium
CN111309911A (en) * 2020-02-17 2020-06-19 昆明理工大学 Case topic discovery method for judicial field
CN111339784A (en) * 2020-03-06 2020-06-26 支付宝(杭州)信息技术有限公司 Automatic new topic mining method and system
CN111339303A (en) * 2020-03-06 2020-06-26 成都晓多科技有限公司 Text intention induction method and device based on clustering and automatic summarization
CN111460153A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Hot topic extraction method and device, terminal device and storage medium
CN111598012A (en) * 2020-05-19 2020-08-28 恒睿(重庆)人工智能技术研究院有限公司 Picture clustering management method, system, device and medium
CN111966792A (en) * 2020-09-03 2020-11-20 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN112800253A (en) * 2021-04-09 2021-05-14 腾讯科技(深圳)有限公司 Data clustering method, related device and storage medium
CN112861990A (en) * 2021-03-05 2021-05-28 电子科技大学 Topic clustering method and device based on keywords and entities and computer-readable storage medium
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN112949710A (en) * 2021-02-26 2021-06-11 北京百度网讯科技有限公司 Image clustering method and device
CN113052245A (en) * 2021-03-30 2021-06-29 重庆紫光华山智安科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN113407792A (en) * 2021-07-06 2021-09-17 亿览在线网络技术(北京)有限公司 Topic-based text input method
CN114579739A (en) * 2022-01-12 2022-06-03 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream
CN116361470A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
US20160292157A1 (en) * 2015-04-06 2016-10-06 Adobe Systems Incorporated Trending topic extraction from social media
KR20160136014A (en) * 2015-05-19 2016-11-29 한국과학기술원 Method and system for topic clustering of big data
CN106339495A (en) * 2016-08-31 2017-01-18 广州智索信息科技有限公司 Topic detection method and system based on hierarchical incremental clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160292157A1 (en) * 2015-04-06 2016-10-06 Adobe Systems Incorporated Trending topic extraction from social media
KR20160136014A (en) * 2015-05-19 2016-11-29 한국과학기술원 Method and system for topic clustering of big data
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
CN106339495A (en) * 2016-08-31 2017-01-18 广州智索信息科技有限公司 Topic detection method and system based on hierarchical incremental clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHIXUE HAN: "A parameter-free hybrid clustering algorithm used for malware categorization", 《2009 3RD INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION IN COMMUNICATION》 *
张亚男 等: "基于混合聚类的微博热点话题发现方法", 《杭州电子科技大学学报(自然科学版)》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298033B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling training extraction system
CN110298033A (en) * 2019-05-29 2019-10-01 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling trains extracting tool
CN110377728A (en) * 2019-06-06 2019-10-25 上海星济信息科技有限公司 Lteral data processing method, system, medium and device
CN110377695A (en) * 2019-06-17 2019-10-25 广州艾媒数聚信息咨询股份有限公司 A kind of public sentiment subject data clustering method, device and storage medium
CN110377695B (en) * 2019-06-17 2022-11-22 广州艾媒数聚信息咨询股份有限公司 Public opinion theme data clustering method and device and storage medium
CN110362685A (en) * 2019-07-22 2019-10-22 腾讯科技(武汉)有限公司 Clustering method and cluster equipment
CN110489558A (en) * 2019-08-23 2019-11-22 网易传媒科技(北京)有限公司 Polymerizable clc method and apparatus, medium and calculating equipment
CN111104511B (en) * 2019-11-18 2023-09-29 腾讯科技(深圳)有限公司 Method, device and storage medium for extracting hot topics
CN111104511A (en) * 2019-11-18 2020-05-05 腾讯科技(深圳)有限公司 Method and device for extracting hot topics and storage medium
CN111309911A (en) * 2020-02-17 2020-06-19 昆明理工大学 Case topic discovery method for judicial field
CN111309911B (en) * 2020-02-17 2022-06-14 昆明理工大学 Case topic discovery method for judicial field
CN111339303B (en) * 2020-03-06 2023-08-22 成都晓多科技有限公司 Text intention induction method and device based on clustering and automatic abstracting
CN111339303A (en) * 2020-03-06 2020-06-26 成都晓多科技有限公司 Text intention induction method and device based on clustering and automatic summarization
CN111339784A (en) * 2020-03-06 2020-06-26 支付宝(杭州)信息技术有限公司 Automatic new topic mining method and system
CN111460153B (en) * 2020-03-27 2023-09-22 深圳价值在线信息科技股份有限公司 Hot topic extraction method, device, terminal equipment and storage medium
CN111460153A (en) * 2020-03-27 2020-07-28 深圳价值在线信息科技股份有限公司 Hot topic extraction method and device, terminal device and storage medium
CN111598012A (en) * 2020-05-19 2020-08-28 恒睿(重庆)人工智能技术研究院有限公司 Picture clustering management method, system, device and medium
CN111966792A (en) * 2020-09-03 2020-11-20 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN111966792B (en) * 2020-09-03 2023-07-25 网易(杭州)网络有限公司 Text processing method and device, electronic equipment and readable storage medium
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN112949710A (en) * 2021-02-26 2021-06-11 北京百度网讯科技有限公司 Image clustering method and device
US11804069B2 (en) 2021-02-26 2023-10-31 Beijing Baidu Netcom Science And Technology Co., Ltd. Image clustering method and apparatus, and storage medium
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
CN112861990B (en) * 2021-03-05 2022-11-04 电子科技大学 Topic clustering method and device based on keywords and entities and computer readable storage medium
CN112861990A (en) * 2021-03-05 2021-05-28 电子科技大学 Topic clustering method and device based on keywords and entities and computer-readable storage medium
CN113052245B (en) * 2021-03-30 2023-08-25 重庆紫光华山智安科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN113052245A (en) * 2021-03-30 2021-06-29 重庆紫光华山智安科技有限公司 Image clustering method and device, electronic equipment and storage medium
CN112800253A (en) * 2021-04-09 2021-05-14 腾讯科技(深圳)有限公司 Data clustering method, related device and storage medium
CN113407792A (en) * 2021-07-06 2021-09-17 亿览在线网络技术(北京)有限公司 Topic-based text input method
CN113407792B (en) * 2021-07-06 2024-03-26 亿览在线网络技术(北京)有限公司 Topic-based text input method
CN114579739B (en) * 2022-01-12 2023-05-30 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream
CN114579739A (en) * 2022-01-12 2022-06-03 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream
CN116361470A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description
CN116361470B (en) * 2023-04-03 2024-05-14 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description

Also Published As

Publication number Publication date
CN109710728B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN109710728A (en) News topic automatic discovering method
Liu et al. TASC: Topic-adaptive sentiment classification on dynamic tweets
Hai et al. Identifying features in opinion mining via intrinsic and extrinsic domain relevance
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
Li et al. Twiner: named entity recognition in targeted twitter stream
Tago et al. Influence analysis of emotional behaviors and user relationships based on Twitter data
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
Gao et al. Filtering of brand-related microblogs using social-smooth multiview embedding
Reinanda et al. Document filtering for long-tail entities
Zeng et al. What you say and how you say it: Joint modeling of topics and discourse in microblog conversations
Li et al. A joint model of conversational discourse and latent topics on microblogs
Raghuvanshi et al. A brief review on sentiment analysis
Šavelka et al. Legal information retrieval for understanding statutory terms
Alsaedi et al. Feature extraction and analysis for identifying disruptive events from social media
Xu et al. Mining Web search engines for query suggestion
Santhosh et al. A multi-model intelligent approach for rumor detection in social networks
Avigdor-Elgrabli et al. Structural clustering of machine-generated mail
Li et al. Modeling topic and community structure in social tagging: The TTR‐LDA‐Community model
Wicaksono et al. Toward advice mining: Conditional random fields for extracting advice-revealing text units
Yilmaz et al. Inferring political alignments of Twitter users
Jansson et al. Topic modelling enriched LSTM models for the detection of novel and emerging named entities from social media
Liao et al. Representativeness-aware aspect analysis for brand monitoring in social media
Farzi et al. Katibeh: A Persian news summarizer using the novel semi-supervised approach
Zhao et al. Modeling Chinese microblogs with five Ws for topic hashtags extraction
Wadawadagi et al. An enterprise perspective of web content analysis research: a strategic road-map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant