CN107145568A - A kind of quick media event clustering system and method - Google Patents

A kind of quick media event clustering system and method Download PDF

Info

Publication number
CN107145568A
CN107145568A CN201710309001.8A CN201710309001A CN107145568A CN 107145568 A CN107145568 A CN 107145568A CN 201710309001 A CN201710309001 A CN 201710309001A CN 107145568 A CN107145568 A CN 107145568A
Authority
CN
China
Prior art keywords
cluster
document
result
text
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710309001.8A
Other languages
Chinese (zh)
Inventor
余军
卢品吟
刘盾
张汨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Hua Seiun Technology Co Ltd
Original Assignee
Chengdu Hua Seiun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Hua Seiun Technology Co Ltd filed Critical Chengdu Hua Seiun Technology Co Ltd
Priority to CN201710309001.8A priority Critical patent/CN107145568A/en
Publication of CN107145568A publication Critical patent/CN107145568A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

Include the invention discloses a kind of quick media event clustering system and method:News handling module:For capturing news documents from news portal, forum and microblogging, preliminary duplicate removal processing is carried out including to text;Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, modus tollens phrase is additionally marked;Newsletter archive event is birdsed of the same feather flock together module:Newsletter archive temporal clustering module, including permutation and combination, the distance that document d is mapped to first layer cluster, document d and son cluster is calculated are carried out to participle, cluster belonging to document d is judged, creates new son and cluster;Data memory module:Result after storage calculating.The present invention under a large amount of public sentiment scenes, can carry out media event of quickly birdsing of the same feather flock together.

Description

A kind of quick media event clustering system and method
Technical field
The present invention relates to Domestic News field, and in particular to a kind of quick media event clustering system and method.
Background technology
With the fast development of internet, network public-opinion is increasing to the influence power of society.Either government network carriage The need for feelings are monitored, or enterprise is the need for branding communication and brand public relations is carried out, how under conditions of substantial amounts of public sentiment, The Sentiment orientation of public sentiment is rapidly analyzed, is guided with carrying out decision support and public sentiment in time, the public opinion ring of response quickly change The problem of border is in the urgent need to address in the analysis of public opinion.Conventional sentiment analysis is, it is necessary to carry out the analysis of complexity, in reply greatly Under the conditions of the public sentiment of amount, it is impossible to accomplish that low latency is handled.
The content of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of quick media event clustering system and side Method, under a large amount of public sentiment scenes, is carrying out quick media event of birdsing of the same feather flock together.
The purpose of the present invention is achieved through the following technical solutions:
A kind of quick media event clustering system, including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Including carrying out permutation and combination to participle, document d being mapped to first layer cluster, calculating The distance of document d and son cluster, judge cluster belonging to document d, create new son cluster;
Data memory module:Result after storage calculating.
A kind of quick media event clustering method, comprises the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done by the barycenter of the word segmentation result and the cluster per height of text, it is false Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C.
Further, described step S02 Chinese version processing modes, specifically include extraction text header, carry out participle, to dividing Word carries out part of speech filtering, only retains nominal and verb character participle.
Further, the first layer clustering method described in described step S03, specifically includes word segmentation result doing arrangement group Close, obtain n combination, each combine according to word rank, word is spliced into character string using separator, gained character string is just It is the key values of first layer cluster, for text d, the corresponding major classes of n key, it is the major class where it to be likely to.
Further, the processing procedure described in step S04, is specifically included:The n key obtained for S3, retrieval obtains n Cluster result(First layer cluster result), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by text Shelves d word segmentation result and the centroid calculation similarity clustered per height, the algorithm of similarity include but is not limited to COS distance calculation Method, this step will export m × n end value.
Further, in described step S05, specifically include and export m × n result, take the value of maximum to be tied as candidate Really, a classification thresholds are rule of thumb set, the scope of threshold value is [0.7,1].
Further, sorting procedure in described step S06, if specifically including r>=g, then directly judge that document d belongs to In the class where r, that is to say, that d belongs to the r corresponding event of class.
Further, the process of new subclass is created in described step S07, if specifically including r>=g, is produced using S4 N × m result, calculate n cluster mean cosine distance, obtain n average value;It is ranked up for this n average value, Maximum is taken, corresponding first layer cluster C is exactly the first layer cluster where document;In cluster C, create using document d as matter The son cluster of the heart.
The beneficial effects of the invention are as follows:Fast Classification can be carried out to different media events by the method for the present invention, So that media event distribution is concentrated, easy-to-read person targetedly searches oneself news interested.
Brief description of the drawings
Fig. 1 is system structure diagram of the invention;
Fig. 2 is flow chart of the method for the present invention.
Embodiment
Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.
As shown in figure 1,
A kind of quick media event clustering system, including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Newsletter archive temporal clustering module, including permutation and combination is carried out to participle, by document d The cluster belonging to the distance, judgement document d that first layer is clustered, calculating document d and son are clustered, the son of establishment newly is mapped to cluster;
Data memory module:Result after storage calculating.
As shown in Figure 2:
A kind of quick media event clustering method, comprises the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done by the barycenter of the word segmentation result and the cluster per height of text, it is false Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C;
Class belonging to last output document d.
Further, described step S02 Chinese version processing modes, specifically include extraction text header, carry out participle, to dividing Word carries out part of speech filtering, only retains nominal and verb character participle.
Further, the first layer clustering method described in described step S03, specifically includes word segmentation result doing arrangement group Close, obtain n combination, each combine according to word rank, word is spliced into character string using separator, gained character string is just It is the key values of first layer cluster, for text d, the corresponding major classes of n key, it is the major class where it to be likely to.
Further, the processing procedure described in step S04, is specifically included:The n key obtained for S3, retrieval obtains n Cluster result(First layer cluster result), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by text Shelves d word segmentation result and the centroid calculation similarity clustered per height, the algorithm of similarity include but is not limited to COS distance calculation Method, this step will export m × n end value.
Further, in described step S05, specifically include and export m × n result, take the value of maximum to be tied as candidate Really, a classification thresholds are rule of thumb set, the scope of threshold value is [0.7,1].
Further, sorting procedure in described step S06, if specifically including r>=g, then directly judge that document d belongs to In the class where r, that is to say, that d belongs to the r corresponding event of class.
Further, the process of new subclass is created in described step S07, if specifically including r>=g, is produced using S4 N × m result, calculate n cluster mean cosine distance, obtain n average value;It is ranked up for this n average value, Maximum is taken, corresponding first layer cluster C is exactly the first layer cluster where document;In cluster C, create using document d as matter The son cluster of the heart.
Described above is only the preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form, is not to be taken as the exclusion to other embodiment, and available for various other combinations, modification and environment, and can be at this In the text contemplated scope, it is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are entered Capable change and change does not depart from the spirit and scope of the present invention, then all should appended claims of the present invention protection domain It is interior.

Claims (8)

1. a kind of quick media event clustering system, it is characterised in that including:
News handling module:For capturing news documents from news portal, forum and microblogging, carried out just including to text Walk duplicate removal processing;
Newsletter archive preliminary treatment module:For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right Modus tollens phrase is additionally marked;
Newsletter archive event is birdsed of the same feather flock together module:Including carrying out permutation and combination to participle, document d being mapped to first layer cluster, calculating The distance of document d and son cluster, judge cluster belonging to document d, create new son cluster;
Data memory module:Result after storage calculating.
2. a kind of quick media event clustering method, it is characterised in that comprise the following steps:
S01:Capture text d, document duplicate removal;
S02:Text header is extracted, to title participle, only retains noun/verb character word;
S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer
S04:In the major class of each cluster, COS distance calculating is done with the barycenter of the word segmentation result and the cluster per height of text, it is false Provided with m son cluster, then m result is produced;
S05:M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should Empirical value is [0.75,1];
S06:If r>=g, the class that text d is belonged to where r;
S07:If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C.
3. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S02 Chinese version processing mode, specifically includes extraction text header, carries out participle, carries out part of speech filtering to participle, only retains nominal With verb character participle.
4. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S03 Described first layer clustering method, specifically includes word segmentation result doing permutation and combination, obtains n combination, each combines according to list Word sorts, and word is spliced into character string using separator, gained character string is exactly the key values of first layer cluster, for text d For, the corresponding major classes of n key, it is the major class where it to be likely to.
5. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S04 Described processing procedure, is specifically included:The n key obtained for S03, retrieval obtains n cluster result(First layer cluster knot Really), for each cluster result, it is assumed that existing m son cluster(The second layer is clustered), by document d word segmentation result and every height The centroid calculation similarity of cluster, the algorithm of similarity includes but is not limited to COS distance algorithm, and this step will export m × n Individual end value.
6. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S05 In, specifically include and export m × n result, take the value of maximum as candidate result, rule of thumb set a classification thresholds, The scope of threshold value is [0.7,1].
7. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S06 Middle sorting procedure, if specifically including r>=g, then directly judge the class that document d is belonged to where r, that is to say, that d belongs to r's The corresponding event of class.
8. a kind of quick media event clustering method according to claim 2, it is characterised in that:Described step S07 The middle process for creating new subclass, if specifically including r>=g, the n × m result produced using S4, it is flat that calculating n is clustered Equal COS distance, obtains n average value;It is ranked up for this n average value, takes maximum, corresponding first layer cluster C is just It is the first layer cluster where document;In cluster C, the son cluster by barycenter of document d is created.
CN201710309001.8A 2017-05-04 2017-05-04 A kind of quick media event clustering system and method Pending CN107145568A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710309001.8A CN107145568A (en) 2017-05-04 2017-05-04 A kind of quick media event clustering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710309001.8A CN107145568A (en) 2017-05-04 2017-05-04 A kind of quick media event clustering system and method

Publications (1)

Publication Number Publication Date
CN107145568A true CN107145568A (en) 2017-09-08

Family

ID=59775358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710309001.8A Pending CN107145568A (en) 2017-05-04 2017-05-04 A kind of quick media event clustering system and method

Country Status (1)

Country Link
CN (1) CN107145568A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334628A (en) * 2018-02-23 2018-07-27 北京东润环能科技股份有限公司 A kind of method, apparatus, equipment and the storage medium of media event cluster
CN110019806A (en) * 2017-12-25 2019-07-16 中国移动通信集团公司 A kind of document clustering method and equipment
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN111538839A (en) * 2020-05-25 2020-08-14 武汉烽火普天信息技术有限公司 Real-time text clustering method based on Jacobsard distance
CN113515624A (en) * 2021-04-28 2021-10-19 乐山师范学院 Text classification method for emergency news

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462360A (en) * 2014-12-05 2015-03-25 北京奇虎科技有限公司 Semantic identifier generating method and device for text set
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN105095434A (en) * 2015-07-23 2015-11-25 百度在线网络技术(北京)有限公司 Recognition method and device for timeliness requirement
CN106599072A (en) * 2016-11-21 2017-04-26 东软集团股份有限公司 Text clustering method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462360A (en) * 2014-12-05 2015-03-25 北京奇虎科技有限公司 Semantic identifier generating method and device for text set
CN104778209A (en) * 2015-03-13 2015-07-15 国家计算机网络与信息安全管理中心 Opinion mining method for ten-million-scale news comments
CN105095434A (en) * 2015-07-23 2015-11-25 百度在线网络技术(北京)有限公司 Recognition method and device for timeliness requirement
CN106599072A (en) * 2016-11-21 2017-04-26 东软集团股份有限公司 Text clustering method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁晓庆: "微博热点话题发现与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
钟文波: "搜索引擎中关键词分类方法评估及推荐应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019806A (en) * 2017-12-25 2019-07-16 中国移动通信集团公司 A kind of document clustering method and equipment
CN110019806B (en) * 2017-12-25 2021-08-06 中移动信息技术有限公司 Document clustering method and device
CN108334628A (en) * 2018-02-23 2018-07-27 北京东润环能科技股份有限公司 A kind of method, apparatus, equipment and the storage medium of media event cluster
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN110245275B (en) * 2019-06-18 2023-09-01 中电科大数据研究院有限公司 Large-scale similar news headline rapid normalization method
CN111538839A (en) * 2020-05-25 2020-08-14 武汉烽火普天信息技术有限公司 Real-time text clustering method based on Jacobsard distance
CN113515624A (en) * 2021-04-28 2021-10-19 乐山师范学院 Text classification method for emergency news

Similar Documents

Publication Publication Date Title
CN107145568A (en) A kind of quick media event clustering system and method
Nguyen et al. Automatic image filtering on social networks using deep learning and perceptual hashing during crises
Li et al. Real-time novel event detection from social media
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
CN103313248B (en) Method and device for identifying junk information
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN104504150A (en) News public opinion monitoring system
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
CN103136266A (en) Method and device for classification of mail
US20130041962A1 (en) Information Filtering
CN106650799A (en) Electronic evidence classification extraction method and system
CN103679012A (en) Clustering method and device of portable execute (PE) files
CN105677661A (en) Method for detecting repetition data of social media
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
Anderson et al. An intelligent online grooming detection system using AI technologies
CN106649338B (en) Information filtering strategy generation method and device
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
CN102063497B (en) Open type knowledge sharing platform and entry processing method thereof
US20170229118A1 (en) Linguistic model database for linguistic recognition, linguistic recognition device and linguistic recognition method, and linguistic recognition system
Jin et al. Filtering spam in Weibo using ensemble imbalanced classification and knowledge expansion
CN105989033A (en) Information duplication eliminating method based on information fingerprints
CN110516066B (en) Text content safety protection method and device
CN107045497A (en) A kind of quick newsletter archive content sentiment analysis system and method
WO2024031930A1 (en) Error log detection method and apparatus, and electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170908