CN107145568A

CN107145568A - A kind of quick media event clustering system and method

Info

Publication number: CN107145568A
Application number: CN201710309001.8A
Authority: CN
Inventors: 余军; 卢品吟; 刘盾; 张汨
Original assignee: Chengdu Hua Seiun Technology Co Ltd
Current assignee: Chengdu Hua Seiun Technology Co Ltd
Priority date: 2017-05-04
Filing date: 2017-05-04
Publication date: 2017-09-08

Abstract

Include the invention discloses a kind of quick media event clustering system and method：News handling module：For capturing news documents from news portal, forum and microblogging, preliminary duplicate removal processing is carried out including to text；Newsletter archive preliminary treatment module：For carrying out preliminary text feature processing to text, including participle, remove stop words, modus tollens phrase is additionally marked；Newsletter archive event is birdsed of the same feather flock together module：Newsletter archive temporal clustering module, including permutation and combination, the distance that document d is mapped to first layer cluster, document d and son cluster is calculated are carried out to participle, cluster belonging to document d is judged, creates new son and cluster；Data memory module：Result after storage calculating.The present invention under a large amount of public sentiment scenes, can carry out media event of quickly birdsing of the same feather flock together.

Description

A kind of quick media event clustering system and method

Technical field

The present invention relates to Domestic News field, and in particular to a kind of quick media event clustering system and method.

Background technology

With the fast development of internet, network public-opinion is increasing to the influence power of society.Either government network carriage The need for feelings are monitored, or enterprise is the need for branding communication and brand public relations is carried out, how under conditions of substantial amounts of public sentiment, The Sentiment orientation of public sentiment is rapidly analyzed, is guided with carrying out decision support and public sentiment in time, the public opinion ring of response quickly change The problem of border is in the urgent need to address in the analysis of public opinion.Conventional sentiment analysis is, it is necessary to carry out the analysis of complexity, in reply greatly Under the conditions of the public sentiment of amount, it is impossible to accomplish that low latency is handled.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of quick media event clustering system and side Method, under a large amount of public sentiment scenes, is carrying out quick media event of birdsing of the same feather flock together.

The purpose of the present invention is achieved through the following technical solutions：

A kind of quick media event clustering system, including：

News handling module：For capturing news documents from news portal, forum and microblogging, carried out just including to text Walk duplicate removal processing；

Newsletter archive preliminary treatment module：For carrying out preliminary text feature processing to text, including participle, remove stop words, it is right Modus tollens phrase is additionally marked；

Newsletter archive event is birdsed of the same feather flock together module：Including carrying out permutation and combination to participle, document d being mapped to first layer cluster, calculating The distance of document d and son cluster, judge cluster belonging to document d, create new son cluster；

Data memory module：Result after storage calculating.

A kind of quick media event clustering method, comprises the following steps：

S01：Capture text d, document duplicate removal；

S02：Text header is extracted, to title participle, only retains noun/verb character word；

S03:Permutation and combination is done to title participle, n combination is obtained, each combines the Key that will be clustered as first layer

S04:In the major class of each cluster, COS distance calculating is done by the barycenter of the word segmentation result and the cluster per height of text, it is false Provided with m son cluster, then m result is produced；

S05：M × n the result to generation is ranked up, the maximum result of value, it is assumed that be r, and concurrently setting empirical value g should Empirical value is [0.75,1]；

S06：If r>=g, the class that text d is belonged to where r

S07：If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C.

Further, described step S02 Chinese version processing modes, specifically include extraction text header, carry out participle, to dividing Word carries out part of speech filtering, only retains nominal and verb character participle.

Further, the first layer clustering method described in described step S03, specifically includes word segmentation result doing arrangement group Close, obtain n combination, each combine according to word rank, word is spliced into character string using separator, gained character string is just It is the key values of first layer cluster, for text d, the corresponding major classes of n key, it is the major class where it to be likely to.

Further, the processing procedure described in step S04, is specifically included：The n key obtained for S3, retrieval obtains n Cluster result（First layer cluster result）, for each cluster result, it is assumed that existing m son cluster（The second layer is clustered）, by text Shelves d word segmentation result and the centroid calculation similarity clustered per height, the algorithm of similarity include but is not limited to COS distance calculation Method, this step will export m × n end value.

Further, in described step S05, specifically include and export m × n result, take the value of maximum to be tied as candidate Really, a classification thresholds are rule of thumb set, the scope of threshold value is [0.7,1].

Further, sorting procedure in described step S06, if specifically including r>=g, then directly judge that document d belongs to In the class where r, that is to say, that d belongs to the r corresponding event of class.

Further, the process of new subclass is created in described step S07, if specifically including r>=g, is produced using S4 N × m result, calculate n cluster mean cosine distance, obtain n average value；It is ranked up for this n average value, Maximum is taken, corresponding first layer cluster C is exactly the first layer cluster where document；In cluster C, create using document d as matter The son cluster of the heart.

The beneficial effects of the invention are as follows：Fast Classification can be carried out to different media events by the method for the present invention, So that media event distribution is concentrated, easy-to-read person targetedly searches oneself news interested.

Brief description of the drawings

Fig. 1 is system structure diagram of the invention；

Fig. 2 is flow chart of the method for the present invention.

Embodiment

Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to It is as described below.

As shown in figure 1,

A kind of quick media event clustering system, including：

Newsletter archive event is birdsed of the same feather flock together module：Newsletter archive temporal clustering module, including permutation and combination is carried out to participle, by document d The cluster belonging to the distance, judgement document d that first layer is clustered, calculating document d and son are clustered, the son of establishment newly is mapped to cluster；

Data memory module：Result after storage calculating.

As shown in Figure 2：

A kind of quick media event clustering method, comprises the following steps：

S01：Capture text d, document duplicate removal；

S06：If r>=g, the class that text d is belonged to where r

S07：If r<G, creates new subclass, according to sub- cluster result, calculates the mean cosine distance of each major class, obtains n Individual value, sequence, takes maximum, if the corresponding major class of maximum is C, the subclass using document d as barycenter is created in C；

Class belonging to last output document d.

Described above is only the preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form, is not to be taken as the exclusion to other embodiment, and available for various other combinations, modification and environment, and can be at this In the text contemplated scope, it is modified by the technology or knowledge of above-mentioned teaching or association area.And those skilled in the art are entered Capable change and change does not depart from the spirit and scope of the present invention, then all should appended claims of the present invention protection domain It is interior.

Claims

1. a kind of quick media event clustering system, it is characterised in that including：

Data memory module：Result after storage calculating.

2. a kind of quick media event clustering method, it is characterised in that comprise the following steps：

S01：Capture text d, document duplicate removal；

S04:In the major class of each cluster, COS distance calculating is done with the barycenter of the word segmentation result and the cluster per height of text, it is false Provided with m son cluster, then m result is produced；

S06：If r>=g, the class that text d is belonged to where r；

3. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S02 Chinese version processing mode, specifically includes extraction text header, carries out participle, carries out part of speech filtering to participle, only retains nominal With verb character participle.

4. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S03 Described first layer clustering method, specifically includes word segmentation result doing permutation and combination, obtains n combination, each combines according to list Word sorts, and word is spliced into character string using separator, gained character string is exactly the key values of first layer cluster, for text d For, the corresponding major classes of n key, it is the major class where it to be likely to.

5. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S04 Described processing procedure, is specifically included：The n key obtained for S03, retrieval obtains n cluster result（First layer cluster knot Really）, for each cluster result, it is assumed that existing m son cluster（The second layer is clustered）, by document d word segmentation result and every height The centroid calculation similarity of cluster, the algorithm of similarity includes but is not limited to COS distance algorithm, and this step will export m × n Individual end value.

6. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S05 In, specifically include and export m × n result, take the value of maximum as candidate result, rule of thumb set a classification thresholds, The scope of threshold value is [0.7,1].

7. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S06 Middle sorting procedure, if specifically including r>=g, then directly judge the class that document d is belonged to where r, that is to say, that d belongs to r's The corresponding event of class.

8. a kind of quick media event clustering method according to claim 2, it is characterised in that：Described step S07 The middle process for creating new subclass, if specifically including r>=g, the n × m result produced using S4, it is flat that calculating n is clustered Equal COS distance, obtains n average value；It is ranked up for this n average value, takes maximum, corresponding first layer cluster C is just It is the first layer cluster where document；In cluster C, the son cluster by barycenter of document d is created.