CN107145568A - A kind of quick media event clustering system and method - Google Patents
A kind of quick media event clustering system and method Download PDFInfo
- Publication number
- CN107145568A CN107145568A CN201710309001.8A CN201710309001A CN107145568A CN 107145568 A CN107145568 A CN 107145568A CN 201710309001 A CN201710309001 A CN 201710309001A CN 107145568 A CN107145568 A CN 107145568A
- Authority
- CN
- China
- Prior art keywords
- cluster
- document
- result
- text
- participle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Abstract
Include the invention discloses a kind of quick media event clustering system and method:News handling module:For capturing news documents from news portal, forum and microblogging, preliminary duplicate removal processing is carried out including to text;Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, modus tollens phrase is additionally marked;Newsletter archive event is birdsed of the same feather flock together module:Newsletter archive temporal clustering module, including permutation and combination, the distance that document d is mapped to first layer cluster, document d and son cluster is calculated are carried out to participle, cluster belonging to document d is judged, creates new son and cluster;Data memory module:Result after storage calculating.The present invention under a large amount of public sentiment scenes, can carry out media event of quickly birdsing of the same feather flock together.
Description
Technical field
The present invention relates to Domestic News field, and in particular to a kind of quick media event clustering system and method.
Background technology
With the fast development of internet, network public-opinion is increasing to the influence power of society.Either government network carriage
The need for feelings are monitored, or enterprise is the need for branding communication and brand public relations is carried out, how under conditions of substantial amounts of public sentiment,
The Sentiment orientation of public sentiment is rapidly analyzed, is guided with carrying out decision support and public sentiment in time, the public opinion ring of response quickly change
The problem of border is in the urgent need to address in the analysis of public opinion.Conventional sentiment analysis is, it is necessary to carry out the analysis of complexity, in reply greatly
Under the conditions of the public sentiment of amount, it is impossible to accomplish that low latency is handled.
The content of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of quick media event clustering system and side
Method, under a large amount of public sentiment scenes, is carrying out quick media event of birdsing of the same feather flock together.
The purpose of the present invention is achieved through the following technical solutions:
A kind of quick media event clustering system, including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text
Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right
Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Including carrying out permutation and combination to participle, document d being mapped to first layer cluster, calculating
The distance of document d and son cluster, judge cluster belonging to document d, create new son cluster;
Data memory module:Result after storage calculating.
A kind of quick media event clustering method, comprises the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done by the barycenter of the word segmentation result and the cluster per height of text, it is false
Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should
Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n
Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C.
Further, described step S02 Chinese version processing modes, specifically include extraction text header, carry out participle, to dividing
Word carries out part of speech filtering, only retains nominal and verb character participle.
Further, the first layer clustering method described in described step S03, specifically includes word segmentation result doing arrangement group
Close, obtain n combination, each combine according to word rank, word is spliced into character string using separator, gained character string is just
It is the key values of first layer cluster, for text d, the corresponding major classes of n key, it is the major class where it to be likely to.
Further, the processing procedure described in step S04, is specifically included:The n key obtained for S3, retrieval obtains n
Cluster result(First layer cluster result), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by text
Shelves d word segmentation result and the centroid calculation similarity clustered per height, the algorithm of similarity include but is not limited to COS distance calculation
Method, this step will export m × n end value.
Further, in described step S05, specifically include and export m × n result, take the value of maximum to be tied as candidate
Really, a classification thresholds are rule of thumb set, the scope of threshold value is [0.7,1].
Further, sorting procedure in described step S06, if specifically including r>=g, then directly judge that document d belongs to
In the class where r, that is to say, that d belongs to the r corresponding event of class.
Further, the process of new subclass is created in described step S07, if specifically including r>=g, is produced using S4
N × m result, calculate n cluster mean cosine distance, obtain n average value;It is ranked up for this n average value,
Maximum is taken, corresponding first layer cluster C is exactly the first layer cluster where document;In cluster C, create using document d as matter
The son cluster of the heart.
The beneficial effects of the invention are as follows:Fast Classification can be carried out to different media events by the method for the present invention,
So that media event distribution is concentrated, easy-to-read person targetedly searches oneself news interested.
Brief description of the drawings
Fig. 1 is system structure diagram of the invention;
Fig. 2 is flow chart of the method for the present invention.
Embodiment
Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to
It is as described below.
As shown in figure 1,
A kind of quick media event clustering system, including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text
Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right
Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Newsletter archive temporal clustering module, including permutation and combination is carried out to participle, by document d
The cluster belonging to the distance, judgement document d that first layer is clustered, calculating document d and son are clustered, the son of establishment newly is mapped to cluster;
Data memory module:Result after storage calculating.
As shown in Figure 2:
A kind of quick media event clustering method, comprises the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done by the barycenter of the word segmentation result and the cluster per height of text, it is false
Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should
Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n
Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C;
Class belonging to last output document d.
Further, described step S02 Chinese version processing modes, specifically include extraction text header, carry out participle, to dividing
Word carries out part of speech filtering, only retains nominal and verb character participle.
Further, the first layer clustering method described in described step S03, specifically includes word segmentation result doing arrangement group
Close, obtain n combination, each combine according to word rank, word is spliced into character string using separator, gained character string is just
It is the key values of first layer cluster, for text d, the corresponding major classes of n key, it is the major class where it to be likely to.
Further, the processing procedure described in step S04, is specifically included:The n key obtained for S3, retrieval obtains n
Cluster result(First layer cluster result), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by text
Shelves d word segmentation result and the centroid calculation similarity clustered per height, the algorithm of similarity include but is not limited to COS distance calculation
Method, this step will export m × n end value.
Further, in described step S05, specifically include and export m × n result, take the value of maximum to be tied as candidate
Really, a classification thresholds are rule of thumb set, the scope of threshold value is [0.7,1].
Further, sorting procedure in described step S06, if specifically including r>=g, then directly judge that document d belongs to
In the class where r, that is to say, that d belongs to the r corresponding event of class.
Further, the process of new subclass is created in described step S07, if specifically including r>=g, is produced using S4
N × m result, calculate n cluster mean cosine distance, obtain n average value;It is ranked up for this n average value,
Maximum is taken, corresponding first layer cluster C is exactly the first layer cluster where document;In cluster C, create using document d as matter
The son cluster of the heart.
Described above is only the preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein
Form, is not to be taken as the exclusion to other embodiment, and available for various other combinations, modification and environment, and can be at this
In the text contemplated scope, it is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are entered
Capable change and change does not depart from the spirit and scope of the present invention, then all should appended claims of the present invention protection domain
It is interior.
Claims (8)
1. a kind of quick media event clustering system, it is characterised in that including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text
Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right
Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Including carrying out permutation and combination to participle, document d being mapped to first layer cluster, calculating
The distance of document d and son cluster, judge cluster belonging to document d, create new son cluster;
Data memory module:Result after storage calculating.
2. a kind of quick media event clustering method, it is characterised in that comprise the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done with the barycenter of the word segmentation result and the cluster per height of text, it is false
Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should
Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r;
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n
Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C.
3. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S02
Chinese version processing mode, specifically includes extraction text header, carries out participle, carries out part of speech filtering to participle, only retains nominal
With verb character participle.
4. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S03
Described first layer clustering method, specifically includes word segmentation result doing permutation and combination, obtains n combination, each combines according to list
Word sorts, and word is spliced into character string using separator, gained character string is exactly the key values of first layer cluster, for text d
For, the corresponding major classes of n key, it is the major class where it to be likely to.
5. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S04
Described processing procedure, is specifically included:The n key obtained for S03, retrieval obtains n cluster result(First layer cluster knot
Really), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by document d word segmentation result and every height
The centroid calculation similarity of cluster, the algorithm of similarity includes but is not limited to COS distance algorithm, and this step will export m × n
Individual end value.
6. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S05
In, specifically include and export m × n result, take the value of maximum as candidate result, rule of thumb set a classification thresholds,
The scope of threshold value is [0.7,1].
7. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S06
Middle sorting procedure, if specifically including r>=g, then directly judge the class that document d is belonged to where r, that is to say, that d belongs to r's
The corresponding event of class.
8. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S07
The middle process for creating new subclass, if specifically including r>=g, the n × m result produced using S4, it is flat that calculating n is clustered
Equal COS distance, obtains n average value;It is ranked up for this n average value, takes maximum, corresponding first layer cluster C is just
It is the first layer cluster where document;In cluster C, the son cluster by barycenter of document d is created.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710309001.8A CN107145568A (en) | 2017-05-04 | 2017-05-04 | A kind of quick media event clustering system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710309001.8A CN107145568A (en) | 2017-05-04 | 2017-05-04 | A kind of quick media event clustering system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107145568A true CN107145568A (en) | 2017-09-08 |
Family
ID=59775358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710309001.8A Pending CN107145568A (en) | 2017-05-04 | 2017-05-04 | A kind of quick media event clustering system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145568A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334628A (en) * | 2018-02-23 | 2018-07-27 | 北京东润环能科技股份有限公司 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
CN110019806A (en) * | 2017-12-25 | 2019-07-16 | 中国移动通信集团公司 | A kind of document clustering method and equipment |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
CN111538839A (en) * | 2020-05-25 | 2020-08-14 | 武汉烽火普天信息技术有限公司 | Real-time text clustering method based on Jacobsard distance |
CN113515624A (en) * | 2021-04-28 | 2021-10-19 | 乐山师范学院 | Text classification method for emergency news |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462360A (en) * | 2014-12-05 | 2015-03-25 | 北京奇虎科技有限公司 | Semantic identifier generating method and device for text set |
CN104778209A (en) * | 2015-03-13 | 2015-07-15 | 国家计算机网络与信息安全管理中心 | Opinion mining method for ten-million-scale news comments |
CN105095434A (en) * | 2015-07-23 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Recognition method and device for timeliness requirement |
CN106599072A (en) * | 2016-11-21 | 2017-04-26 | 东软集团股份有限公司 | Text clustering method and device |
-
2017
- 2017-05-04 CN CN201710309001.8A patent/CN107145568A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462360A (en) * | 2014-12-05 | 2015-03-25 | 北京奇虎科技有限公司 | Semantic identifier generating method and device for text set |
CN104778209A (en) * | 2015-03-13 | 2015-07-15 | 国家计算机网络与信息安全管理中心 | Opinion mining method for ten-million-scale news comments |
CN105095434A (en) * | 2015-07-23 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Recognition method and device for timeliness requirement |
CN106599072A (en) * | 2016-11-21 | 2017-04-26 | 东软集团股份有限公司 | Text clustering method and device |
Non-Patent Citations (2)
Title |
---|
丁晓庆: "微博热点话题发现与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
钟文波: "搜索引擎中关键词分类方法评估及推荐应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019806A (en) * | 2017-12-25 | 2019-07-16 | 中国移动通信集团公司 | A kind of document clustering method and equipment |
CN110019806B (en) * | 2017-12-25 | 2021-08-06 | 中移动信息技术有限公司 | Document clustering method and device |
CN108334628A (en) * | 2018-02-23 | 2018-07-27 | 北京东润环能科技股份有限公司 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
CN110245275B (en) * | 2019-06-18 | 2023-09-01 | 中电科大数据研究院有限公司 | Large-scale similar news headline rapid normalization method |
CN111538839A (en) * | 2020-05-25 | 2020-08-14 | 武汉烽火普天信息技术有限公司 | Real-time text clustering method based on Jacobsard distance |
CN113515624A (en) * | 2021-04-28 | 2021-10-19 | 乐山师范学院 | Text classification method for emergency news |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107145568A (en) | A kind of quick media event clustering system and method | |
Nguyen et al. | Automatic image filtering on social networks using deep learning and perceptual hashing during crises | |
Li et al. | Real-time novel event detection from social media | |
CN108737423B (en) | Phishing website discovery method and system based on webpage key content similarity analysis | |
CN111460153B (en) | Hot topic extraction method, device, terminal equipment and storage medium | |
CN103313248B (en) | Method and device for identifying junk information | |
CN109218223B (en) | Robust network traffic classification method and system based on active learning | |
CN104504150A (en) | News public opinion monitoring system | |
CN112165462A (en) | Attack prediction method and device based on portrait, electronic equipment and storage medium | |
CN103136266A (en) | Method and device for classification of mail | |
US20130041962A1 (en) | Information Filtering | |
CN106650799A (en) | Electronic evidence classification extraction method and system | |
CN103679012A (en) | Clustering method and device of portable execute (PE) files | |
CN105677661A (en) | Method for detecting repetition data of social media | |
CN103324886B (en) | A kind of extracting method of fingerprint database in network intrusion detection and system | |
Anderson et al. | An intelligent online grooming detection system using AI technologies | |
CN106649338B (en) | Information filtering strategy generation method and device | |
CN111190873B (en) | Log mode extraction method and system for log training of cloud native system | |
CN102063497B (en) | Open type knowledge sharing platform and entry processing method thereof | |
US20170229118A1 (en) | Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system | |
Jin et al. | Filtering spam in Weibo using ensemble imbalanced classification and knowledge expansion | |
CN105989033A (en) | Information duplication eliminating method based on information fingerprints | |
CN110516066B (en) | Text content safety protection method and device | |
CN107045497A (en) | A kind of quick newsletter archive content sentiment analysis system and method | |
WO2024031930A1 (en) | Error log detection method and apparatus, and electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170908 |