CN109710728A - News topic automatic discovering method - Google Patents
News topic automatic discovering method Download PDFInfo
- Publication number
- CN109710728A CN109710728A CN201811417992.2A CN201811417992A CN109710728A CN 109710728 A CN109710728 A CN 109710728A CN 201811417992 A CN201811417992 A CN 201811417992A CN 109710728 A CN109710728 A CN 109710728A
- Authority
- CN
- China
- Prior art keywords
- cluster
- text
- theme
- batch
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of news topic automatic discovering method disclosed by the invention, it is desirable to provide a method of it can be improved the accuracy of news topic discovery.The technical scheme is that: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is clustered in batches, pretreatment input text, text formatting coding is unified to article, calculate text feature, generate Text eigenvector, extract text feature word, construct Text eigenvector collection, Subject Clustering is first done in batch, theme internal layer time cluster is done again, then each single-point theme is calculated at a distance from similarity i.e. each single-point to each cluster centre of all clusters, it is merged into maximum class cluster, cluster across batch is agglomerated mutually, hierarchical clustering between completion theme;It generates news topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, newly-increased data clusters result is done with existing cluster result then and is merged across batch.
Description
Technical field
The present invention relates to Text Mining Technology field more particularly to a kind of excavations of hot news topic, mixed based on multistep
The news topic for closing increment cluster finds method.
Background technique
Internet news is one that people obtain news messages as one of information type most important in the Internet media
Important channel.Internet is related to the various aspects of life, and webpage information is being increased in a manner of blowout on internet, newly
It hears data and shows magnanimity, facing to the webpage information of the magnanimity so to grow at top speed, for common netizen, it is desirable to search
The network information required for individual becomes more and more difficult, for government department, timely and accurately finds common net
The hot topic of people's care simultaneously controls network public-opinion, has become a critically important Men Xueke.It is new due to what is increased severely on network
The characteristics such as many and diverse redundancy and redundancy are heard, the workload that news topic is searched by manually is huge.To people from magnanimity news
It obtains information needed and brings huge challenge.Although the portal website of domestic some large sizes is often to most of within certain time
The hot topic and focus event that ordinary user pays close attention to, delivery network Special Topics in Journalism, but website still mainly passes through at this stage
It is manually screened and is edited, relevant news is put into a special news, this mode has very big drawback, not
It can effectively meet the needs of users in time, and often the personal view containing web editor is inclined to for the inside, it cannot be completely objective
See the truth of neutral zone reaction event.Therefore, it how fast and accurately to be obtained in the webpage information from these in terms of tens billion of general
The popular information of logical netizen's concern and hot topic, the major event especially occurred in the recent period have become common netizen
Demand true.Wherein, event (Event) refers at what time, what place, is related to the spy what personage occurs
Determine event.Topic (Topic) is regarded it as by an emergency event and the dependent event caused by this emergency event forms, can
To think set that a topic is made of multiple events.Report (Story) is the related news report to some event, main
Report the contents such as Internet news contribution, TV news report segment and Broadcast Journalism casting.The cutting of news report just needs
This section of news report is cut into, the Reporting about different topics is divided into independent news report segment.Newspeak
The main task of the detection TD of topic is to find and identify the hot spot thing of newest generation from the news report information flow generated in real time
Part and hot topic.Classical topic detection and tracking system be all built upon the content of text of news report be transformed into
On this basis quantity space model VSM, but VSM can bring noise information when carrying out text representation, this can give new thing
The detection of part brings error.The topic detection system of actual use is generally required in real time increase news report document
Amount cluster, it is desirable that can rapidly and accurately find the event and sudden critical event of kainogenesis.The digging of hot news topic
Pick has more application value relative to topic detection and tracking, the automatic discovery of hot news topic.Traditional topic finds master
Be directed to long text and news data collection, extensive short text have the characteristics that it is sparse, without structure, make an uproar more, conventional method is very
Difficult effectively discovery topic., all there are some problems, such as focus incident in present existing hot news topic discovery system
The phenomenon that accuracy rate of detection is not high, and mistake is divided is very universal, and most of hot news event only carries out news report
It simply enumerates, there is no the extractions that these focus incidents are carried out with Feature Words and summary info, this is unfavorable for the news of user very much
Browsing and news are read.The focus incident topic of many portal websites be mostly by manually being edited and being organized,
This mode can only choose those one month or 1 year hot topic, can not issue newest unexpected incidents in time,
Real-time is not strong.Although it is interested that topic discovery technique can help people quickly to obtain from internet information resource abundant
Information, help government, enterprises and institutions to grasp newest public sentiment dynamic in time.But it can not traditional file classification method is straight
Scoop out and use in topic detection system, news report document handled by topic detection and tracking be dynamic it is increased, with when
Between passage, news documents are gradually increasing, if directly text classification or cluster using the increased news text of these dynamics
On shelves, number of documents is excessive, and the Space-time Complexity needed is excessively high, this needs to carry out existing text classification and clustering algorithm
The characteristics of improvement, Lai Shiying topic detection and tracking.
Topic is the discovery that finds topic from magnanimity news data automatically, is that index combs data with topic, will
Content related with topic is polymerize, is organized together.Topic finds that method is substantially by way of increment cluster by magnanimity
Text in text data with Similar content gathers in identical theme, so that topic height is similar between text in same class,
Topic similarity is low between different class texts.The method of traditional text increment cluster is broadly divided into two classes:
First kind method is that all data are iterated cluster every time, and certain interval of time again carries out all data primary
Property cluster, advantage is precision height, the disadvantage is that the cluster result of front cannot be utilized, waste of resource, each cluster result cannot be protected
Demonstrate,prove consistency;Second class method is the cluster result before utilizing, and newly-increased data are divided into class nearest from it in existing class cluster
In cluster, and the cluster mass center is recalculated, advantage is not need to recalculate cluster to all data every time, the disadvantage is that with class cluster
Continuous increase, class cluster is easy to happen mass center drift, and topic can not possess duration, and because be by new data and existing class cluster
Similitude comparison is carried out, therefore new class cluster can not be generated, it is low to generate topic accuracy rate.Meanwhile existing increment clustering method leads to
Cluster task often is completed only with a kind of clustering method, therefore there is problems: because cluster is a kind of unsupervised learning
Method, needs specified initial clustering number, and conventional method is not accounted for due to randomly choosing cluster matter during clustering initialization
The heart leads to have the problem of similitude between cluster class cluster that may be present, i.e., the data for obviously belonging to same topic are divided into not
With in topic;Further fusion treatment is not done to the single-point class cluster generated in disposable cluster process, leads to that there are multiple
It only include the topic of single text.
A kind of mixing increment clustering method is applied in news topic discovery analysis by present invention research, and technology is conceived to solution
Certainly above-mentioned increment clustering method drawback present in topic discovery procedure, and propose corresponding increment cluster topic discovery analysis
Process.
Summary of the invention
Object of the present invention is to be conceived to solve above-mentioned increment clustering method drawback present in topic discovery procedure, provide
One kind corresponding to increment cluster topic and based on mixing increment cluster, realize to the topic tissue of magnanimity news data and discovery,
The continuity for keeping existing topic improves the news topic automatic discovering method of the accuracy of topic discovery,
Above-mentioned purpose of the invention can realize that a kind of news topic automatic discovering method, feature exists by following measures
In including the following steps: setting increment cluster correlation parameter and increment Clustering Trigger parameter first, incremental data is carried out in batches
Secondary cluster carries out Text Pretreatment operation to input text in batch, obtains batch of articles N, unify text formatting to it
It encodes, carries out Chinese word segmentation, goes additional character and stop words, then calculate text feature, generate Text eigenvector, extract text
Eigen word constructs Text eigenvector collection, and first does Subject Clustering in batch, then is theme internal layer time cluster, subdivision master
All clusters being previously obtained then between hierarchical clustering theme, are merged theme: tried again bottom-up by topic
Coagulation type hierarchical clustering, it is poly- that a top-down Split type stratification is done for all articles in each theme
Article in theme is gradually subdivided into smaller and smaller cluster by class, to small cluster obtained above carry out single-point theme merge and
Non- single-point merger selects relatively more similar article in m articles, and single-point theme merges between doing batch, calculates each single-point
Every article of theme selects more than threshold value at a distance from the similarity of all clusters and each single-point to each cluster centre
A maximum class cluster do and merge, then carry out clustering between batch, between the group cluster Jing Guo standalone processes result again
Primary bottom-up coagulation type hierarchical clustering is done, the cluster across batch is agglomerated mutually, level is poly- between completing theme
Class;To cluster result, according to keyword weight sequencing, extraction represent the highest previous group word of a news topic weight again, generate new
It hears topic and carries out new class cluster fusion, new class cluster mass center is compared with existing history class cluster mass center, meeting threshold value then will be new
Class cluster merges with existing history class cluster, is otherwise used as new class cluster;Then newly-increased data clusters result is done with existing cluster result
Across batch fusion.
The beneficial effects of the present invention are:
The present invention be different from traditional news media topic find method, the increment clustering method of use, by magnanimity news data into
Row batch is divided, is clustered in batches, cluster result merges in batch, cluster result melts between the fusion of single-point topic, batch in batch
Close etc. multisteps clustering processing, using multistep cluster, merge, cluster again by the way of improve resource utilization and topic discovery result
Accuracy.By being clustered in batches to incremental data, so that cluster result can use existing cluster result every time, no
It needs to do clustering processing again to all data every time, economize on resources, and guarantee the consistency of cluster result.
The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, achievees the purpose that segment theme, protect
Similitude is low between demonstrate,proving theme, and independence is strong.Further by hierarchical clustering between theme, solve due to during clustering initialization with
The problem of same topic caused by machine selection cluster mass center is split.By to the single-point class generated in disposable cluster process
Cluster does further fusion treatment, solves the topical questions there are multiple comprising single text.To newly-increased data clusters result with
Existing cluster result does across batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.
Cluster result consistency.The present invention passes through setting increment Clustering Trigger parameter first, carries out in batches to incremental data
Secondary cluster does not need every time to do at cluster all data again so that cluster result can use existing cluster result every time
Reason, economizes on resources, and guarantee the consistency of cluster result.
Theme independence.The present invention, which passes through in batch, first does Subject Clustering, then does theme internal layer time cluster, reaches subdivision master
The purpose of topic, similitude is low between guaranteeing theme, and independence is strong.
Theme cutting.Hierarchical clustering is solved due to choosing random during clustering initialization between the present invention further passes through theme
Select the problem of same topic caused by cluster mass center is split.
Single-point theme.The present invention is by doing at further fusion the single-point class cluster generated in disposable cluster process
Reason solves the topical questions there are multiple comprising single text.To newly-increased data clusters result and existing cluster result do across
Batch fusion, the final accuracy for improving topic and the continuity for guaranteeing topic.
Detailed description of the invention
Fig. 1 is that news topic of the present invention finds processing flow schematic diagram automatically.
Fig. 2 is the schematic diagram of Fig. 1 Text Pretreatment process.
Fig. 3 is the embodiment flow diagram of multistep mixing increment clustering processing of the present invention.
To make the object, technical solutions and advantages of the present invention clearer, below with reference to embodiment and attached drawing, to this hair
It is bright to be described in further detail.
Specific embodiment
Refering to fig. 1.According to the present invention, increment cluster correlation parameter and increment Clustering Trigger parameter are set first, to increment
Data are clustered in batches;Text Pretreatment operation is carried out to input text, batch of articles N is obtained, text is unified to it
Said shank carries out Chinese word segmentation, goes additional character and stop words, and first doing Subject Clustering in batch, then does theme internal layer
Secondary cluster completes hierarchical clustering between theme, then calculates text feature, generates Text eigenvector;Extract text feature word, structure
Build Text eigenvector collection;Judge whether input text data amount meets single batch cluster amount of text batchsize or time
Whether interval meets single batch cluster time interval timeout, if reaching single batch cluster time interval, list has not been reached yet
Batch clusters amount of text, then cluster is automatically begun to, and does to the single-point class cluster generated in disposable cluster process and further melts
Conjunction processing carries out mixing increment clustering processing to input text, generates news topic;To cluster result, according to keyword weight is arranged
Sequence extracts the highest previous group word of weight to represent a news topic.By new class cluster mass center and existing history class cluster mass center
It is compared, meets threshold value and then merge new class cluster with existing history class cluster, and carry out new class cluster fusion, be otherwise used as new class
Cluster, and new class cluster fusion is carried out, newly-increased data clusters result is done with existing cluster result and is merged across batch, then cluster is tied
Fruit takes keyword to indicate topic;Judge whether to continue with again, otherwise terminate, be, persistently receive data, returns to input text
This progress Text Pretreatment operation executes circulate operation to terminating.The single-point class cluster generated in disposable cluster process is done
Further fusion treatment is done newly-increased data clusters result with existing cluster result and is merged across batch, the final standard for improving topic
True property and the continuity for guaranteeing topic.
Increment cluster correlation parameter setting includes: single batch cluster amount of text batchsize, between the single batch cluster time
Every hierarchical clustering similarity threshold between hierarchical clustering similarity threshold wordSimThreshold, theme in timeout, theme
Single-point clusters the single-point similarity threshold compared with having cluster class cluster in value wordSimThreshold, batch
Single-point clusters the similarity threshold compared with having cluster class cluster between innerBatchSPKNNThreshold and batch
crossBatchSPThreshold。
In the reason process of following embodiment, using following steps:
[Text Pretreatment]: obtaining batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character
With stop words, calculating text feature, Text eigenvector is generated;
[judgement of automatic cluster condition]: judge to input whether amount of text meets single batch cluster amount of text, if reached
Time interval is clustered to single batch, single batch cluster amount of text has not been reached yet, then cluster is automatically begun to;
[in batches cluster]: by N articles be divided into it is several in small batches, clustering processing in batch is done for each small batch data, if currently
Have m articles in batch, and carry out Subject Clustering in batch: by each it is small quantities of in the Feature Words of all article do theme and gather
Class obtains some themes, this m article is associated with these themes, selects a maximally related theme for each article,
If independently generating free theme without related subject;Hierarchical clustering in theme is carried out again, segments theme: being based on article
Between similarity include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. refine article
Theme;One top-down Split type hierarchical clustering is done for all articles in each theme, then between theme
Hierarchical clustering, merge theme: for all clusters being previously obtained, try again bottom-up coagulation type hierarchical clustering
(being equivalent to the cluster across theme) solves same subject and is split as multiple themes;Single-point theme is reduced, single-point theme merges:
Merger is done to some small clusters obtained in process above, including single-point and non-single-point, for each single-point, at this m
Selection and their relatively more similar articles, solve the problems, such as that single-point theme is excessive in article;
[between batch single-point theme merge]: the cluster obtained for every a batch, these clusters can be divided into two classes again, single-point and non-
Single-point, tries again and similar process, the similarity of every article Yu all clusters is calculated each single-point theme, to every
A single-point calculates the distance that it arrives each cluster centre, and the maximum class cluster for selecting more than threshold value, which is done, to be merged;
[clustering between batch]: the result (i.e. a group cluster) for passing over standalone processes tries again the bottom of between them
Upward coagulation type hierarchical clustering, this step are needed here due to being to have done to have no basis in batches to article in step 3
Cluster across batch is agglomerated mutually.
[generate news topic]: to cluster result according to keyword weight sequencing, extract the highest previous group word of weight to
Represent a news topic.
[new class cluster fusion]: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class
Cluster merges with existing history class cluster, is otherwise used as new class cluster.
Refering to Fig. 2.By segmenting in this present embodiment using jieba, jieba participle is to processing text formatting requirement
GBK coding, therefore input data is converted to GBK coded format by the present embodiment unification, and UTF-8 format is used after word segmentation processing
It saves.Text Pretreatment process provided in this embodiment carries out pretreatment operation to input text.Specifically include the following steps:
All news documents are initialized as a major class cluster first, the division of class cluster is then carried out, major class cluster is split into
Group cluster, until the threshold value of one of class cluster meets preset threshold value.Unified text encoding format, input data format is united
One switchs to the Chinese character codes such as UTF-8 or GBK, and actual coding format file format according to needed for participle step is arranged;It unites to format
Text data after one carries out Chinese word segmentation processing, and as described in previous step, the present embodiment is segmented using jieba, adds simultaneously
Enter Custom Dictionaries, text is split into word one by one by part of speech;Then stop words and special is carried out to result after participle
Symbol processing because by many of the text that parses after participle stop word and additional character (as ,),
It will affect cluster result in clustering.This is stopped useless by the deactivated dictionary of introducing and additional character table the step for
Word and additional character are removed;It is based on previous step Text Pretreatment as a result, to Text Feature Extraction Feature Words.The present embodiment uses word
Frequently-inverse article word frequency TF-IDF carries out Feature Words extraction to article;It is extracted based on previous step text feature word as a result, constructing text
Feature vector, i.e. word vector indicate text, for using when follow-up text similarity system design.Every article is arranged by TF-IDF value
Sequence chooses the maximum t word in front and forms feature word list, and the feature vector of every article has by comparing with feature word list
It then adds, does not add 0 then, constitute the respective feature vector of every article;Judge to input whether amount of text meets parameter
Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.
Refering to Fig. 3.Mixing increment clustering processing process provided in this embodiment, detailed step are as follows:
It clusters in batches: N number of article being divided into several according to batchsize, effect is to do batch cohesion for every batch of data
Class processing, so needing to limit the quantity (setting in present lot has m articles) of article in batch:
Subject Clustering in batch: doing Subject Clustering for the Feature Words of article all in each batch, is equivalent to these texts
The content of the Feature Words composition of chapter does subject extraction, obtains some themes, this m article is associated with these themes, is every
One article selects a maximally related theme, if independently generating free theme without related subject.It is adopted in the present embodiment
It uses LDA algorithm to generate some initial subjects to newsletter archive in batch: here using word as characteristic item, text being regarded as by spy
Levy what word was constituted, every text can include multiple themes, and correspond to each theme with different probability, i.e. every text is corresponding
One theme probability distribution, each theme correspond to different Word probability distributions, and each word of every text is with different probability pair
Answer some theme.M articles in batch, each document d can regard a word sequence w=< w as in m articles1,
w2,...,wn>, wiIndicate the word in i-th of lexical item text.
Some initial subjects are generated to newsletter archive in batch using LDA algorithm: using word as characteristic item, text being regarded as
It is to be made of Feature Words, the corresponding theme probability distribution of every text, each word of every text is with different probability pair
Answer each theme;T represents theme corresponding to the Feature Words of each text, k theme < t1,t2,...,tk> using P (word | text
Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table
Show two vectors of the ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themesd<pt1,pt2,...,
ptk>, the probability of various words is generated to each themeObtain document classification theme result, wherein
ptiIndicate the probability of corresponding i-th of the theme of document d, pwiIndicate the probability of i-th of word of generation.
During Subject Clustering, this m article is associated with to these themes, selects a most phase for each article
The theme of pass has some articles that will not be associated with any theme, because having some keywords since its effect is not strong, or crucial
The parameter of term clustering is improper, is filtered, and this article is caused not to be associated with one keyword of any of the above, it is considered that
It is to be associated with a free theme;
Hierarchical clustering in theme segments theme: doing a top-down division for all articles in each theme
Article in theme is gradually subdivided into smaller and smaller cluster, used in cluster process based on the phase between article by formula hierarchical clustering
It include that title similarity, text similarity, name entity similarity, non-name entity similarity etc. gather article like degree
Class, take wherein similarity threshold controlled by wordSimThreshold, meet similarity threshold then using text as a cluster.
The present embodiment similarity uses cosine similarity calculation formula (1), is sentenced by calculating the angle between text vector
The similarity degree of disconnected vector, angle is smaller, and it is more similar to represent two articles.The step main purpose is used to carry out theme thin
Point, it is really achieved and a coarseness theme is subdivided into the different themes with obvious distinction, improve the accurate of cluster result
Property;Similarity calculation equally uses formula between theme:
N represents the feature sum of Text eigenvector A and Text eigenvector B in formula, and Ai indicates the in Text eigenvector A
I feature, Bi indicate the ith feature in Text eigenvector B, and cos θ indicates the angle between two text vectors.Theme
Between hierarchical clustering, merge theme: for all clusters being previously obtained, the bottom-up coagulation type stratification that tries again is poly-
Class (is equivalent to the cluster across theme), merges and meets the class cluster of Topic Similarity threshold value, and cluster process is based on title similarity, just
Literary similarity, name entity similarity, non-name entity similarity, threshold value are controlled by dficfSimThreshold.Step master
It further to be promoted poly- to solve a case where real theme is split out in the theme released as initialization
The accuracy of class result;
Single-point theme is reduced, single-point theme merges: to some small clusters obtained in process above (including single-point and non-list
Point) merger is done, the class cluster of these single-points towards non-single-point is next done merger, the process of merger is, for each single-point,
Selection and their relatively more similar articles (similarity > innerBatchSPKNNThreshold article) in this m article,
Then seeing the single-point by ballot method, similar article quantity is most to which class cluster, and finally which class cluster selection should be grouped into
In.The step is mainly used for solving the problems, such as that there are excessive single-point class clusters, and by further merging, practical related single-point is closed
And both solve the problems, such as more single-point class clusters, further improve the accuracy of cluster result;
Single-point theme merges between batch: the cluster obtained for every a batch, these clusters can be divided into two classes, single-point and non-list again
Point, try again the process similar with 4.4.Here it is not both with 4.4, since 4.4 be do on m articles, the size of m
It is conditional, it is possible to the similarity with every article be calculated to each single-point, but worst case may be n here
A cluster, n are a very big numbers, so only calculating its distance to each cluster centre, choosing to each single-point here
Select maximum one (threshold value is given by crossBatchSPThreshold) more than threshold value.Further by single-point between batch into
Row fusion, reduces single-point class cluster, promotes cluster result accuracy;
Cluster between batch: the result (i.e. a group cluster) for having passed over standalone processes tries again the bottom of between them
Upward coagulation type hierarchical clustering, threshold value are similarly dficfSimThreshold.The purpose of this processing step be due to
It is to have done to have no basis in batches to article in step 3, the cluster across batch is needed exist for agglomerate mutually.
Generate new topic: finally to keyword in cluster result by weight sequencing, w word is for indicating that the cluster is accumulate before taking
The staple of conversation contained.If you need to continue the step of monitoring carries out topic detection and discovery, then persistently receives data, repeat front.
New class cluster fusion: new class cluster mass center is compared with existing history class cluster mass center, meets threshold value then for new class cluster
Merge with existing history class cluster, is otherwise used as new class cluster.
The above is present pre-ferred embodiments, it has to be noted that the present invention will be described for above-described embodiment, so
And the present invention is not limited thereto, and those skilled in the art can be designed when being detached from scope of the appended claims
Alternative embodiment.For those skilled in the art, without departing from the spirit and substance in the present invention,
Various changes and modifications can be made therein, these variations and modifications are also considered as protection scope of the present invention.
Claims (10)
1. a kind of news topic automatic discovering method, it is characterised in that include the following steps: that setting increment cluster correlation first is joined
Several and increment Clustering Trigger parameter, clusters incremental data in batches, carries out text to input text in batch and locates in advance
Reason operation, obtains batch of articles N, unifies text formatting coding to it, carries out Chinese word segmentation, goes additional character and stop words,
Text feature is calculated, Text eigenvector is generated, extracts text feature word, constructs Text eigenvector collection, and in batch first
Subject Clustering is done, then does theme internal layer time cluster, theme is segmented, then between hierarchical clustering theme, for what is be previously obtained
All clusters merge theme: try again bottom-up coagulation type hierarchical clustering, for the institute in each theme
There is article to do a top-down Split type hierarchical clustering, article in theme is gradually subdivided into smaller and smaller cluster, it is right
Small cluster obtained above carries out single-point theme and merges with non-single-point merger, the article similar in selection theme in m articles,
Single-point theme merges between doing batch, calculates the similarity of each single-point theme article and all clusters, i.e., each single-point arrives
The distance of each cluster centre, the maximum class cluster for selecting more than threshold value, which is done, to be merged, and then carries out clustering between batch, right
Try again bottom-up coagulation type hierarchical clustering between a group cluster by standalone processes result, will be poly- across batch
Class agglomerates mutually, completes hierarchical clustering between theme;To cluster result, according to keyword weight sequencing, extraction represent one newly again
The highest previous group word of topic weight is heard, news topic is generated and carries out new class cluster fusion, by new class cluster mass center and existing history
Class cluster mass center is compared, and is met threshold value and is then merged new class cluster with existing history class cluster, and new class cluster is otherwise used as;Then to new
Increasing data clusters result is done with existing cluster result to be merged across batch.
2. news topic automatic discovering method as described in claim 1, it is characterised in that: produced in disposable cluster process
Raw single-point class cluster does further fusion treatment, does to newly-increased data clusters result with existing cluster result and merges across batch, most
The accuracy of topic is improved eventually and guarantees the continuity of topic.
3. news topic automatic discovering method as described in claim 1, it is characterised in that: increment cluster correlation parameter setting packet
Include: single batch clusters amount of text batchsize, single batch clusters time interval, hierarchical clustering similarity threshold in theme
Hierarchical clustering similarity threshold between wordSimThreshold, theme, in batch single-point cluster compared with having and clustering class cluster
Single-point clusters the similarity threshold compared with having cluster class cluster between single-point similarity threshold and batch.
4. news topic automatic discovering method as described in claim 1, it is characterised in that: in cluster in batches, by N texts
Chapter is divided into several small quantities of, does clustering processing in batch for each small batch data, if there is m articles in present lot, and is criticized
Secondary interior Subject Clustering: by each it is small quantities of in the Feature Words of all article do Subject Clustering, some themes are obtained, by this m
Article is associated with these themes, selects a maximally related theme for each article, if without related subject, it is independent
Generate free theme;Hierarchical clustering and subdivision theme in theme are carried out again.
5. news topic automatic discovering method as described in claim 1, it is characterised in that: include based on the similarity between article
Title similarity, text similarity, name entity similarity, non-name entity similarity refine article theme.
6. news topic automatic discovering method as described in claim 1, it is characterised in that: unified text encoding format, it will be defeated
Enter data format and uniformly switch to UTF-8 or GBK Chinese character code, Chinese word segmentation processing is carried out to the text data of format after reunification,
It is segmented using jieba, while Custom Dictionaries is added, text is split into word one by one by part of speech;Then to being tied after participle
Fruit carries out stop words and additional character processing, introduces and deactivates dictionary and additional character table for useless stop words and additional character
Remove;It is based on Text Pretreatment as a result, to Text Feature Extraction Feature Words.
7. news topic automatic discovering method as described in claim 1, it is characterised in that: using the inverse article word frequency TF- of word frequency-
IDF carries out Feature Words extraction to article;It is extracted based on text feature word as a result, constructing Text eigenvector, i.e. word vector table
Show text, for using when follow-up text similarity system design.
8. news topic automatic discovering method as described in claim 1, it is characterised in that: every article sorts by TF-IDF value
It chooses the maximum t word in front and forms feature word list, the feature vector of every article has then by comparing with feature word list
Addition does not add 0 then, constitutes the respective feature vector of every article;Judge to input whether amount of text meets parameter
Batchsize has not been reached yet if reaching time timeout in batchsize, then cluster is automatically begun to.
9. news topic automatic discovering method as described in claim 1, it is characterised in that: using LDA algorithm to new in batch
It hears some initial subjects of text generation: using word as characteristic item, text being regarded as and is made of Feature Words, every text correspondence
One theme probability distribution, each theme correspond to different Word probability distributions, k theme < t1,t2,...,tk> using P (word | text
Shelves) and=P (word | theme) * P (theme | document) formula, it trains and obtains indicating the ith feature Ai in Text eigenvector A, table
Show two vectors of ith feature Bi in Text eigenvector B, and corresponds to the probability θ of different themesd<pt1,pt2,...,ptk
>, the probability of various words is generated to each themeObtain document classification theme result, wherein t generation
Theme corresponding to the Feature Words of each text of table, ptiIndicate the probability of corresponding i-th of the theme of document d, pwiIt indicates to generate i-th
The probability of a word.
10. news topic automatic discovering method as described in claim 1, it is characterised in that: in theme subdivision, similarity is used
Cosine similarity calculation formula (2) calculates the angle between text vector to judge the similarity degree of vector, similarity meter between theme
It calculates and equally uses formula:
In formula, θ indicates that the angle between two text vectors, n represent the feature of Text eigenvector A and Text eigenvector B
Sum, Ai indicate the ith feature in Text eigenvector A, and Bi indicates the ith feature in Text eigenvector B.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811417992.2A CN109710728B (en) | 2018-11-26 | 2018-11-26 | Automatic news topic discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811417992.2A CN109710728B (en) | 2018-11-26 | 2018-11-26 | Automatic news topic discovery method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109710728A true CN109710728A (en) | 2019-05-03 |
CN109710728B CN109710728B (en) | 2022-05-17 |
Family
ID=66255118
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811417992.2A Active CN109710728B (en) | 2018-11-26 | 2018-11-26 | Automatic news topic discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710728B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298033A (en) * | 2019-05-29 | 2019-10-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling trains extracting tool |
CN110362685A (en) * | 2019-07-22 | 2019-10-22 | 腾讯科技(武汉)有限公司 | Clustering method and cluster equipment |
CN110377728A (en) * | 2019-06-06 | 2019-10-25 | 上海星济信息科技有限公司 | Lteral data processing method, system, medium and device |
CN110377695A (en) * | 2019-06-17 | 2019-10-25 | 广州艾媒数聚信息咨询股份有限公司 | A kind of public sentiment subject data clustering method, device and storage medium |
CN110489558A (en) * | 2019-08-23 | 2019-11-22 | 网易传媒科技(北京)有限公司 | Polymerizable clc method and apparatus, medium and calculating equipment |
CN111104511A (en) * | 2019-11-18 | 2020-05-05 | 腾讯科技(深圳)有限公司 | Method and device for extracting hot topics and storage medium |
CN111309911A (en) * | 2020-02-17 | 2020-06-19 | 昆明理工大学 | Case topic discovery method for judicial field |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111339303A (en) * | 2020-03-06 | 2020-06-26 | 成都晓多科技有限公司 | Text intention induction method and device based on clustering and automatic summarization |
CN111460153A (en) * | 2020-03-27 | 2020-07-28 | 深圳价值在线信息科技股份有限公司 | Hot topic extraction method and device, terminal device and storage medium |
CN111598012A (en) * | 2020-05-19 | 2020-08-28 | 恒睿(重庆)人工智能技术研究院有限公司 | Picture clustering management method, system, device and medium |
CN111966792A (en) * | 2020-09-03 | 2020-11-20 | 网易(杭州)网络有限公司 | Text processing method and device, electronic equipment and readable storage medium |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
CN112800253A (en) * | 2021-04-09 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Data clustering method, related device and storage medium |
CN112861990A (en) * | 2021-03-05 | 2021-05-28 | 电子科技大学 | Topic clustering method and device based on keywords and entities and computer-readable storage medium |
CN112926298A (en) * | 2021-03-02 | 2021-06-08 | 北京百度网讯科技有限公司 | News content identification method, related device and computer program product |
CN112949710A (en) * | 2021-02-26 | 2021-06-11 | 北京百度网讯科技有限公司 | Image clustering method and device |
CN113052245A (en) * | 2021-03-30 | 2021-06-29 | 重庆紫光华山智安科技有限公司 | Image clustering method and device, electronic equipment and storage medium |
CN113407792A (en) * | 2021-07-06 | 2021-09-17 | 亿览在线网络技术(北京)有限公司 | Topic-based text input method |
CN114579739A (en) * | 2022-01-12 | 2022-06-03 | 中国电子科技集团公司第十研究所 | Topic detection and tracking method for text data stream |
CN116361470A (en) * | 2023-04-03 | 2023-06-30 | 北京中科闻歌科技股份有限公司 | Text clustering cleaning and merging method based on topic description |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320646A (en) * | 2015-11-17 | 2016-02-10 | 天津大学 | Incremental clustering based news topic mining method and apparatus thereof |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
US20160292157A1 (en) * | 2015-04-06 | 2016-10-06 | Adobe Systems Incorporated | Trending topic extraction from social media |
KR20160136014A (en) * | 2015-05-19 | 2016-11-29 | 한국과학기술원 | Method and system for topic clustering of big data |
CN106339495A (en) * | 2016-08-31 | 2017-01-18 | 广州智索信息科技有限公司 | Topic detection method and system based on hierarchical incremental clustering |
-
2018
- 2018-11-26 CN CN201811417992.2A patent/CN109710728B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160292157A1 (en) * | 2015-04-06 | 2016-10-06 | Adobe Systems Incorporated | Trending topic extraction from social media |
KR20160136014A (en) * | 2015-05-19 | 2016-11-29 | 한국과학기술원 | Method and system for topic clustering of big data |
CN105320646A (en) * | 2015-11-17 | 2016-02-10 | 天津大学 | Incremental clustering based news topic mining method and apparatus thereof |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
CN106339495A (en) * | 2016-08-31 | 2017-01-18 | 广州智索信息科技有限公司 | Topic detection method and system based on hierarchical incremental clustering |
Non-Patent Citations (2)
Title |
---|
ZHIXUE HAN: "A parameter-free hybrid clustering algorithm used for malware categorization", 《2009 3RD INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION IN COMMUNICATION》 * |
张亚男 等: "基于混合聚类的微博热点话题发现方法", 《杭州电子科技大学学报(自然科学版)》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110298033B (en) * | 2019-05-29 | 2022-07-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling training extraction system |
CN110298033A (en) * | 2019-05-29 | 2019-10-01 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling trains extracting tool |
CN110377728A (en) * | 2019-06-06 | 2019-10-25 | 上海星济信息科技有限公司 | Lteral data processing method, system, medium and device |
CN110377695A (en) * | 2019-06-17 | 2019-10-25 | 广州艾媒数聚信息咨询股份有限公司 | A kind of public sentiment subject data clustering method, device and storage medium |
CN110377695B (en) * | 2019-06-17 | 2022-11-22 | 广州艾媒数聚信息咨询股份有限公司 | Public opinion theme data clustering method and device and storage medium |
CN110362685A (en) * | 2019-07-22 | 2019-10-22 | 腾讯科技(武汉)有限公司 | Clustering method and cluster equipment |
CN110489558A (en) * | 2019-08-23 | 2019-11-22 | 网易传媒科技(北京)有限公司 | Polymerizable clc method and apparatus, medium and calculating equipment |
CN111104511B (en) * | 2019-11-18 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Method, device and storage medium for extracting hot topics |
CN111104511A (en) * | 2019-11-18 | 2020-05-05 | 腾讯科技(深圳)有限公司 | Method and device for extracting hot topics and storage medium |
CN111309911A (en) * | 2020-02-17 | 2020-06-19 | 昆明理工大学 | Case topic discovery method for judicial field |
CN111309911B (en) * | 2020-02-17 | 2022-06-14 | 昆明理工大学 | Case topic discovery method for judicial field |
CN111339303B (en) * | 2020-03-06 | 2023-08-22 | 成都晓多科技有限公司 | Text intention induction method and device based on clustering and automatic abstracting |
CN111339303A (en) * | 2020-03-06 | 2020-06-26 | 成都晓多科技有限公司 | Text intention induction method and device based on clustering and automatic summarization |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111460153B (en) * | 2020-03-27 | 2023-09-22 | 深圳价值在线信息科技股份有限公司 | Hot topic extraction method, device, terminal equipment and storage medium |
CN111460153A (en) * | 2020-03-27 | 2020-07-28 | 深圳价值在线信息科技股份有限公司 | Hot topic extraction method and device, terminal device and storage medium |
CN111598012A (en) * | 2020-05-19 | 2020-08-28 | 恒睿(重庆)人工智能技术研究院有限公司 | Picture clustering management method, system, device and medium |
CN111966792A (en) * | 2020-09-03 | 2020-11-20 | 网易(杭州)网络有限公司 | Text processing method and device, electronic equipment and readable storage medium |
CN111966792B (en) * | 2020-09-03 | 2023-07-25 | 网易(杭州)网络有限公司 | Text processing method and device, electronic equipment and readable storage medium |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
CN112949710A (en) * | 2021-02-26 | 2021-06-11 | 北京百度网讯科技有限公司 | Image clustering method and device |
US11804069B2 (en) | 2021-02-26 | 2023-10-31 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Image clustering method and apparatus, and storage medium |
CN112926298A (en) * | 2021-03-02 | 2021-06-08 | 北京百度网讯科技有限公司 | News content identification method, related device and computer program product |
CN112861990B (en) * | 2021-03-05 | 2022-11-04 | 电子科技大学 | Topic clustering method and device based on keywords and entities and computer readable storage medium |
CN112861990A (en) * | 2021-03-05 | 2021-05-28 | 电子科技大学 | Topic clustering method and device based on keywords and entities and computer-readable storage medium |
CN113052245B (en) * | 2021-03-30 | 2023-08-25 | 重庆紫光华山智安科技有限公司 | Image clustering method and device, electronic equipment and storage medium |
CN113052245A (en) * | 2021-03-30 | 2021-06-29 | 重庆紫光华山智安科技有限公司 | Image clustering method and device, electronic equipment and storage medium |
CN112800253A (en) * | 2021-04-09 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Data clustering method, related device and storage medium |
CN113407792A (en) * | 2021-07-06 | 2021-09-17 | 亿览在线网络技术(北京)有限公司 | Topic-based text input method |
CN113407792B (en) * | 2021-07-06 | 2024-03-26 | 亿览在线网络技术(北京)有限公司 | Topic-based text input method |
CN114579739B (en) * | 2022-01-12 | 2023-05-30 | 中国电子科技集团公司第十研究所 | Topic detection and tracking method for text data stream |
CN114579739A (en) * | 2022-01-12 | 2022-06-03 | 中国电子科技集团公司第十研究所 | Topic detection and tracking method for text data stream |
CN116361470A (en) * | 2023-04-03 | 2023-06-30 | 北京中科闻歌科技股份有限公司 | Text clustering cleaning and merging method based on topic description |
CN116361470B (en) * | 2023-04-03 | 2024-05-14 | 北京中科闻歌科技股份有限公司 | Text clustering cleaning and merging method based on topic description |
Also Published As
Publication number | Publication date |
---|---|
CN109710728B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109710728A (en) | News topic automatic discovering method | |
Liu et al. | TASC: Topic-adaptive sentiment classification on dynamic tweets | |
Hai et al. | Identifying features in opinion mining via intrinsic and extrinsic domain relevance | |
KR101536520B1 (en) | Method and server for extracting topic and evaluating compatibility of the extracted topic | |
Li et al. | Twiner: named entity recognition in targeted twitter stream | |
Tago et al. | Influence analysis of emotional behaviors and user relationships based on Twitter data | |
Shi et al. | Learning-to-rank for real-time high-precision hashtag recommendation for streaming news | |
Gao et al. | Filtering of brand-related microblogs using social-smooth multiview embedding | |
Reinanda et al. | Document filtering for long-tail entities | |
Zeng et al. | What you say and how you say it: Joint modeling of topics and discourse in microblog conversations | |
Li et al. | A joint model of conversational discourse and latent topics on microblogs | |
Raghuvanshi et al. | A brief review on sentiment analysis | |
Šavelka et al. | Legal information retrieval for understanding statutory terms | |
Alsaedi et al. | Feature extraction and analysis for identifying disruptive events from social media | |
Xu et al. | Mining Web search engines for query suggestion | |
Santhosh et al. | A multi-model intelligent approach for rumor detection in social networks | |
Avigdor-Elgrabli et al. | Structural clustering of machine-generated mail | |
Li et al. | Modeling topic and community structure in social tagging: The TTR‐LDA‐Community model | |
Wicaksono et al. | Toward advice mining: Conditional random fields for extracting advice-revealing text units | |
Yilmaz et al. | Inferring political alignments of Twitter users | |
Jansson et al. | Topic modelling enriched LSTM models for the detection of novel and emerging named entities from social media | |
Liao et al. | Representativeness-aware aspect analysis for brand monitoring in social media | |
Farzi et al. | Katibeh: A Persian news summarizer using the novel semi-supervised approach | |
Zhao et al. | Modeling Chinese microblogs with five Ws for topic hashtags extraction | |
Wadawadagi et al. | An enterprise perspective of web content analysis research: a strategic road-map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |