CN108959484A

CN108959484A - More tactful media data filtration methods and its device towards event detection

Info

Publication number: CN108959484A
Application number: CN201810645129.6A
Authority: CN
Inventors: 陈刚; 唐永旺; 魏晗; 席耀; 席耀一; 郭志刚; 袁江林
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2018-12-07
Anticipated expiration: 2038-06-21
Also published as: CN108959484B

Abstract

The more tactful media data filtration methods and its device, this method that the present invention relates to a kind of towards event detection include: off-line phase constructs junk user database and applies source blacklist list according to the media subscriber data being collected into；The online recognition stage, for media data flow, media data filtering is carried out by junk user database and application source blacklist list, non-event media data is filtered by media content and contextual feature, and on-line talking is carried out to media data, identification events class cluster purifies the media data in event class cluster.The present invention effectively solves the influence of noise data and other non-event data to microblogging event detection in microblog data stream, most non-event microbloggings in microblog data stream can be cleared up, effectively improve microblogging event detection performance, it is real-time, practical, convenient for extracting hot topic and emergency event in time, there is important directive significance to new media Research On The Key Technology In Data Stream.

Description

More tactful media data filtration methods and its device towards event detection

Technical field

The invention belongs to media data processing technology field, in particular to a kind of more tactful media numbers towards event detection According to filtration method and its device.

Background technique

As the Typical Representative of New Media, microblogging, which is one, can be convenient quick publication viewpoint, sharing and propagates information Important Platform.Due to the convenient of microblogging, real-time and interactivity, the report of hot spot and major event for many common people concern Road and propagation, superiority have been even more than traditional media and portal website, microblogging have become information acquisition, the marketing with And the important information source of the industries such as public sentiment monitoring.It, can be from massive micro-blog data using the event detection technology towards microblogging Middle extraction current social hot topic and vital emergent event are controlled at one's side so that user preferably be helped to understand news dynamic The major event of generation.However, being also flooded with a large amount of nothing in microblogging other than some hot news and the report of emergency event With information, including the junk information etc. that advertising information, daily life trival matters, network rumour and server automatically generate, how These junk information and significant event microblogging are distinguished have become microblog data stream event detection significant challenge it One.For this problem, existing scholar purifies microblog data by some filtering policies to improve the performance of microblogging event detection, It has purified microblog data stream to a certain extent, improves event detection performance, but filtering policy used in them is more Single, clean-up effect is limited, and can not assess clean-up effect.

Summary of the invention

Aiming at the shortcomings in the prior art, the present invention provides a kind of more tactful media data flow filterings towards event detection Method and device thereof can clear up most non-event microbloggings in microblog data stream, effectively improve microblogging event detection Can, preferably help user understands news dynamic.

According to design scheme provided by the present invention, a kind of more tactful media data flow filtering sides towards event detection Method includes following content:

Off-line phase constructs junk user database and applies source blacklist according to the media subscriber data being collected into List；

The online recognition stage, for media data flow, by junk user database and application source blacklist list into The filtering of row media data is filtered non-event media data by media content and contextual feature, and is carried out to media data Line cluster, identification events class cluster purify the media data in event class cluster.

It is above-mentioned, in off-line phase, filtered based on the media data in user and source, collect individual subscriber social networks and The media data delivered extracts user behavior characteristics and media data content feature, offline to construct junk user database and answer With source blacklist list, junk user is identified by supervision machine study；Judge media subscriber in media data flow Whether derive from the presence or absence of in junk user database or media data using source blacklist list, to media data into Row directly filters.

Preferably, user behavior characteristics include user's reputation degree, forward rate and liveness, and user's reputation degree is according to user's powder Silk number, user's perpetual object quantity, the bean vermicelli quantity of bean vermicelli user and the perpetual object quantity of bean vermicelli user obtain, and forward rate is logical Crossing in a plurality of media data that user delivers forwards media data proportion to obtain, and liveness delivers media data according to user The number of days and user's registration number of days crossed over obtain.

Preferably, media data content feature includes short chain feature, label characteristics, blog article length characteristic, blog article repetition It spends feature, blog article word Biodiversity Characteristics and is forwarded comment rate, several media datas that short chain feature is delivered according to user In the accounting of the media data containing URL obtain；Containing popular words in several media datas that label characteristics are delivered according to user The accounting for inscribing the media data of label obtains；The average length for several media datas that blog article length characteristic is delivered according to user and Length variance is calculated；The cosine similarity of several media datas that blog article multiplicity feature is delivered according to user between any two Average value obtain；Several media datas that blog article is issued with word Biodiversity Characteristics by counting user, and according to non-duplicate word Number and each non-duplicate character frequency of occurrence and the total character ratio three of media data are accorded with to obtain；It is logical to be forwarded comment rate The accounting for the sum of forwarding and commenting on number is crossed in several media datas of counting user publication to obtain.

Above-mentioned, media data flow is directed in the online recognition stage, firstly, by junk user database and applying source Blacklist list carries out media data filtering；Then, two classification are carried out to media data using media content and contextual feature, Filter non-event media data；Clustering is carried out to the similar media data of theme, extracts class cluster feature, identification events class Cluster, wherein class cluster feature includes at least class cluster time and class cluster theme；And it is based on the theme principle of correspondence, in event class cluster Media data is cleared up, purifying medium data.

Above-mentioned, non-event media data is filtered, includes following content: firstly, media data flow passes through non-supervisory machine Study carries out on-line talking processing, obtains media class cluster, extracts class cluster feature；Then, learn to carry out model using supervision machine Training, is filtered non-event media class cluster by trained model.

Preferably, class cluster feature include in theme feature, social characteristics and temporal aspect, wherein theme feature passes through The average value and method of media data and the cosine similarity at class cluster center obtain；Social characteristics, by counting each media class Comprising forwarding, commenting on, replying and referring to that shared ratio obtains in cluster；Temporal aspect, by counting media class cluster medium-high frequency word Frequency of occurrence and the frequency histogram that according to time sequence generates obtain.

Further, temporal aspect includes following two category feature: 1) high frequency words it is expected deviation, count in current time class cluster The difference of the frequency of occurrence of each high frequency words and the desired frequency, and by the difference and media data quantity phase hourly in class cluster It removes, wherein the expectation frequency is calculated according to the mean value of the frequency in historical time section, is occurred according to class cluster medium-high frequency word Frequency information distributes weight to each high frequency words, obtains the high frequency words expectation deviation of weighting class cluster；2) high frequency words histogram distribution With the fitting degree of exponential function, the characteristic of exponential distribution feature is presented based on the hot spot word in social networks, utilizes minimum two Multiplication is fitted the corresponding exponential distribution function of distribution histogram of high frequency words, and high frequency words histogram is measured by Counting statistics amount The fitting degree of distribution and exponential function.

Above-mentioned, purifying medium data include following content: participle carried out to media data and stop words removes, according to Word frequency height in class cluster selects word frequency to be greater than the vocabulary of given threshold value as class cluster mass center；According to blog article term frequency-inverse document frequency Calculate term weight, and in cumulative single media data mass center vocabulary weight, it is similar to class cluster mass center to obtain media data Degree；Media data by similarity lower than specified threshold is removed from class cluster.

A kind of more tactful media data flow filter devices towards event detection include: off-line training module, filter module Block, cluster module and cleaning module, wherein

Off-line training module, for constructing junk user database and application coming according to the media subscriber data being collected into Source blacklist list；

Filtering module is primarily based on junk user database and application source blacklist arranges for being directed to media data flow Then table filtration media data carry out two classification to media data using media content and contextual feature, filter non-event matchmaker Volume data；

Cluster module extracts class cluster feature, identifies thing for carrying out on-line talking analysis to the similar media data of theme Part class cluster, wherein class cluster feature includes at least class cluster time and class cluster theme；

Cleaning module, for being cleared up based on the theme principle of correspondence the media data in event class cluster, purifying medium Data.

Beneficial effects of the present invention:

The present invention is based on the behavior of microblog users and content characteristic, by constructing junk user database offline, and according to The micro-blog information of the database and the filtering of spam application blacklist list from junk user and application blacklist；Using in microblogging Hold and contextual feature carries out two classification to microblogging, filters most non-event microblogging；By on-line talking technology to theme Similar microblogging carries out clustering, and the various features identification events class cluster such as time and theme for extracting class cluster；Based on theme The consistency principle clears up the low quality microblogging in event class cluster；Effectively solve noise data and other in microblog data stream Influence of the non-event data to microblogging event detection can clear up most non-event microbloggings in microblog data stream, effectively Microblogging event detection performance is improved, it is real-time, practical, convenient for extracting current social hot topic and great burst thing in time Part has important directive significance to new media data flow event monitoring technology.

Detailed description of the invention:

Fig. 1 is method flow schematic diagram in embodiment；

Fig. 2 is online recognition stage media data flow filtering process figure in embodiment；

Fig. 3 is schematic device in embodiment；

Fig. 4 is device working principle diagram in embodiment；

Specific embodiment:

To make the object, technical solutions and advantages of the present invention clearer, understand, with reference to the accompanying drawing with technical solution pair The present invention is described in further detail.

For in the filtering technique of existing media data such as content of microblog, microblog number is purified by some filtering policies The performance for improving microblogging event detection accordingly, has purified microblog data stream to a certain extent, has improved event detection performance, But filtering policy used in them is more single, and clean-up effect is limited, and can not assess clean-up effect.For this purpose, The embodiment of the present invention, it is shown in Figure 1, a kind of more tactful media data filtration methods towards event detection are provided, include Following content:

More tactful media data filterings towards event detection include offline and online two stages, and rubbish is realized in offline part Discovery, junk user database and the building using source blacklist list of rubbish user, online part can be to media data Stream carries out real time filtering, effectively solves in microblog data stream noise data and other non-event data to the shadow of microblogging event detection It rings, improves the performance of microblogging event detection.

In off-line phase, all microbloggings collecting the personal social relationship information of user and delivering for a period of time recently are mentioned Two category feature of user behavior and content of microblog is taken, finds junk user using there is the machine learning algorithm of supervision.Wherein, user Behavioural characteristic includes following content:

(1) user's reputation degree

Junk user and normal users have biggish difference on perpetual object and the bean vermicelli quantity possessed.Rubbish is used More perpetual object is often added in order to improve spread scope in family especially advertising user, but due to its user's microblogging quality Lower generally to be difficult to possess a large amount of bean vermicelli, the ratio of bean vermicelli quantity and perpetual object can react to a certain extent user's Reputation degree.On the other hand, the reputation of user and the quality of bean vermicelli also have biggish relationship, the bean vermicelli that non-junk user is possessed Reputation is generally higher, and the bean vermicelli of junk user otherwise quantity is seldom or quality lower (bean vermicelli itself may be exactly that rubbish is used Family).It is as follows to define user's reputation degree:

Wherein,Indicate the number of fans of user u,Indicate the perpetual object quantity of user u,It indicates to use The bean vermicelli quantity of i-th of bean vermicelli user of family u,Indicate the perpetual object quantity of i-th of bean vermicelli user of user u, M is indicated The bean vermicelli quantity of user u.Influence of the bean vermicelli quality to user's reputation degree itself is added in the definition of user's technorati authority.

(2) forward rate

Junk user and normal users forward the behavior of other user's microbloggings different, junk user especially commercial paper User often largely delivers original microblogging, and common non-junk user is generally also a large amount of while delivering original microblogging Forward the microblogging of other users.According to actual needs, ratio shared by microblogging is forwarded in 100 microbloggings that can be delivered recently user Example is defined as microblogging forward rate.

(3) recent liveness

The junk user of new registration is easy to be identified by microblog, junk user tend to be more than using registion time The user of a period of time disseminates junk information, shows as delivering a large amount of junk information within a short period of time.It utilizes " active in the recent period Degree " index characterizes this feature, is defined as follows:

Wherein, according to actual needs, it can defineIt is the day that user u delivers that 100 original microbloggings are crossed over recently Number,It is the number of days of user u registration.For account of being sunk into sleep (other one such as used by steal-number person active suddenly As user account), the value of recent liveness can be higher；And for user very active always or for a long time sluggish one As user, this feature value numerical value is lower.

Content of microblog feature includes following content:

(1) short chain feature

Microblogging length is limited within 140 characters, to include more information, junk user in limited number of words Short chain is usually added in microblogging and fetches propagation junk information；And general less be added in original blog article of normal users links (news or media class microblog account are an exceptions).It include the microblogging quantity of URL in the microblogging that counting user is delivered recently, it will The ratio of the sum of delivered microblogging is defined as the short chain feature of user's blog article recently with user for it, optional according to actual needs Take 100 microbloggings that family is delivered recently.

(2) hashtag feature

Microblogging indicates a topic with hashtag (label), and the microblogging that label is added can not only show user's Bean vermicelli can also allow all users for participating in topic discussion to see the microblogging.In order to increase the exposure of blog article, junk user warp The discussion of topic often is participated in by the way that hot topic label spoofing is added, and junk information is added in blog article.According to actual needs, The hashtag that the ratio of microblogging in 100 microbloggings that can deliver recently user comprising hashtag is defined as user's blog article is special Sign.

(3) blog article length characteristic

To show that information as much as possible, junk user often utilize 140 character of microblogging as much as possible in microblogging Length limitation, the blog article delivered is generally all long, and the delivered blog article of normal users then has with short, according to actual needs The average length and length variance for 100 microbloggings that statistics available user is issued recently characterize the length characteristic of blog article.

(4) blog article multiplicity feature

Normal users seldom issue the identical blog article of content, and junk user is then on the contrary, in order to expand its issuing microblog Coverage and duration, often publication content is essentially identical or completes duplicate blog article.Use can be calculated according to actual needs It is special to be defined as microblogging multiplicity by the cosine similarity of 100 microbloggings that family is issued recently between any two for the average value of similarity Sign.

(5) blog article is with word diversity (calculate entropy)

Junk user is different from the purpose of the delivered microblogging of normal users, this issues the word and word that blog article is included at them On also reflected.Normal users pay close attention to that theme is relatively wide in range, and it is wider to win word or word distribution as used herein, and Junk user usually generates blog article using template, and the non-duplicate words in blog article is relatively fewer.It can collect according to actual needs 100 microbloggings that user issues recently remove the stop words in microblogging and link, count what each non-duplicate character occurred by word Number, the entropy for calculating microblog data collection are used word Biodiversity Characteristics as blog article, are defined as follows:

Wherein, the number of non-duplicate character, p in 100 microbloggings that M expression user u is delivered recently_u(i) user u is indicated most The ratio of number and microblog data lump character that 100 microblog datas closely issued concentrate i-th of non-duplicate character to occur.

(6) it is forwarded and comment rate

Junk user institute issuing microblog is mostly useless junk information, less to be forwarded and commented on by other users, according to It is actually needed in 100 original microbloggings that statistics available user issues recently and forwards and comment on microblogging of the sum of the number more than given threshold Shared ratio is defined as being forwarded and comment rate for user.

The identification of junk user is two classification problems, and the identification of junk user can be carried out by supervised learning algorithm； It may be selected to concentrate the superior support vector machines of performance (SVM) as classification in Small Sample Database in the embodiment of the present invention Device.The target of SVM is construction optimal hyperlane, keeps error in classification minimum, then will by the nonlinear transformation of kernel function appropriate The input space transforms to a higher dimensional space, then seeks optimal classification surface in new space.Since the Generalization Capability of SVM depends on The selection of kernel functional parameter and error penalty factor, therefore to obtain good classification performance and first have to select suitable kernel function Parameter.There are mainly three types of common kernel functions: polynomial function, radial basis function (RBF) and Sigmod function；By using diameter To basic function SVM, which supports the svm classifier model of any complex boundary, be embodied as using the open source of SVM libSVM。

Microblogging source includes two classes: the application program of microblogging publisher and issuing microblog.It is used by the rubbish constructed offline User data library and application source blacklist, may be implemented the rubbish microblogging based on source and quickly filter.Filtering process is as follows: for A blog article in microblog data stream extracts its User ID and using source, retrieves junk user database, check that database is The no ID for having the user, if it is present directly determining that microblogging is rubbish microblogging；If there is no the use in junk user database Family ID is then filtered microblogging according to the application source for issuing the blog article, if coming in the blacklist of self-application source, then it is assumed that The microblogging is rubbish microblogging, otherwise enters subsequent processing.

It is shown in Figure 2 in yet another embodiment of the present invention, in the online recognition stage, for media data (microblog number According to) stream real time filtering, include following content:

S101, media data filtering is carried out by junk user database and application source blacklist list；

S102, two classification are carried out to media data using media content and contextual feature, filters non-event media data；

S103, clustering is carried out to the similar media data of theme, extracts class cluster feature, identification events class cluster, wherein Class cluster feature includes at least class cluster time and class cluster theme；

S104, it is based on the theme principle of correspondence, the media data in event class cluster is cleared up, purifying medium data.

In the embodiment of the present invention, the microwave filter based on content can be realized, classifier by the machine learning method of supervision SVM is selected, realization process can design as follows:

(1) blog article pre-processes, participle, stop words removal；

(2) feature extraction:

Hyperlink quantity: commercial paper rubbish microblogging often includes one or more short links, these links are directed toward The webpages such as company's publicity, commercial advertisement or adult advertisements, and common microblogging is often less comprising linking；

Whether author is to add V user: adding certification of the V user Jing Guo microblog system, confidence level with higher generally will not At will deliver rubbish microblogging；And junk user is generally difficult to obtain the certification of system；

Rubbish word quantity: usually containing the relative refuses vocabulary such as mass advertising or pornographic in rubbish microblogging, such as: preferential more More, whole audience packet postal, over-bought are sent more, establish a rubbish vocabulary herein, count the number that rubbish vocabulary occurs in microblogging；

Blog article length: rubbish microblogging and common non-junk microblogging also have certain difference in blog article length, in blog article In include more information, rubbish microblogging is often longer, and another situation is then that blog article is very short, and the information content for including has very much Limit, and the general moderate length of common microblogging.

Emotion word quantity: microblogging is users' exchange of information, the platform for delivering emotion, in common microblogging often There is the vocabulary and emoticon of some expression author's emotions and evaluation, such as: give power, happiness, blessing, [tear], [emotion].This Inventive embodiments can construct the emotion vocabulary of an emotional symbol comprising given threshold number emotion word and given threshold number, Then the number that emotion word and emotional symbol occur in microblogging is counted.

Whether Hashtag quantity is more than one: microblogging indicates that a topic, some rubbish are micro- with hashtag (label) It is rich can be comprising the label of one or more hot topics in order to expand its propagation effect, and the label data one that common microblogging includes As do not exceed one.

@quantity: common microblogging includes the limited amount of@, and rubbish microblogging is in order to expand spread scope often in blog article It is middle that multiple@are added to refer to multiple users.

Whether are the other users of microblogging head@: the microblogging normally forwarded can add automatically@mark (//@) in microblogging head；

Whether among microblogging the other users of@: insertion "@user name " can be by microblogging active transmission to other among microblogging User；

Name physical quantities: it will appear time, place, personage etc. in the blog article of event class microblogging and name entity, and rubbish The probability that this kind of name entity occurs in microblogging is lower.Microblogging is pre-processed, remove microblogging in hashtags, link and Participle and part-of-speech tagging are carried out after referring to it, calculates the ratio that name entity in microblogging accounts for entire microblogging length.

Special punctuation mark quantity: the punctuation mark in common microblogging generally comprises comma, fullstop, branch, question mark and sense Exclamation, and some rubbish microbloggings are to get around the strobe utility of microblog system, it can be in some rubbish keys with obvious characteristic Some special punctuation marks are added among word, such as: " * ", " & ", " [", "] " " [" and "] " special punctuation mark, these punctuates Symbol is less to be occurred in common microblogging, counts one kind weight of the quantity as rubbish microblogging of this kind of special punctuation mark in blog article Want feature.

Whether is forwarding microblogging: rubbish microblogging is generally less is forwarded by other users, the chance that common microblogging is forwarded It is then larger, it whether is to forward microblogging as the important feature for distinguishing rubbish microblogging and common microblogging using microblogging.

Forwarding comment ratio: the forwarding of microblogging and comment number are directly influenced by acquisition time, are not suitable as individually Differentiating characteristics are forwarded and are commented on herein the ratio of number using microblogging to distinguish rubbish microblogging and general microblogging.

In rubbish microblogging identification process based on SVM, it may be designed as 1000 microbloggings of random selection from microblog data stream, Whether it is that rubbish is manually marked to it, obtains several rubbish microbloggings and several common non-junk microbloggings.It is micro- by this 1000 Rich data set is divided into 10 parts, is trained using 10 cross validations and test data set.

On-line talking is a kind of efficient non-supervisory machine learning algorithm, has obtained widely answering in microblogging event detection With.In the embodiment of the present invention, clustering is carried out to the similar media data of theme, microblog data stream is by on-line talking After reason, what is obtained is some microblogging class clusters, the microblogging similarity with higher in class cluster, the microblogging similarity between class cluster compared with It is low；Three category features for extracting class cluster realize non-event microblogging class cluster using Supervised machine learning algorithm training identification model Filtering, wherein the three classes feature of class cluster is as follows:

(1) theme feature

Theme feature describes the consensus information of theme in microblogging class cluster, commonly assumes that the theme phase that event class cluster is included To concentration, and the theme that non-event class cluster is covered is more dispersed.In addition, the co-occurrence that event class cluster and non-event class cluster are included Vocabulary is also different.Microblogging in event class cluster often includes more co-occurrence term, these vocabulary describe one from different perspectives A common theme；And the co-occurrence term negligible amounts that non-event class cluster is included, but these co-occurrence terms are directed toward different masters Topic, such as " sleep ", " work ".It can be by calculating being averaged for microblogging and the cosine similarity at class cluster center in the embodiment of the present invention Value and variance characterize the consistency of microblogging class cluster theme；Meanwhile counting and occurring the frequency that different vocabulary occur in class cluster, it calculates Percentage shared by the microblogging of vocabulary comprising the highest given threshold number of the frequency.

(2) social characteristics

Social behaviors between microblog users include forwarding, comment on, reply and referring to.Forwarding refers to that user sends out original author The original contents of table microblogging are issued in the microblogging space of oneself, and user can also be plus to original microblogging while forwarding Comment；Reply is the independent answer that user carries out for the comment of other users；Refer to it being that user is oriented microblogging by@mark It is sent to designated user.Four kinds of social rows are replied and are referred in the forwarding, comment of user in event class microblogging and non-event class microblogging To be different, as: the microblogging that the big V of microblogging is issued often largely is forwarded, but these microbloggings may all be of big V publication People's state updates, and does not include any event information.It can be by counting in each microblogging class cluster comprising turning in the embodiment of the present invention Send out, comment on, reply and refer to social characteristics of the ratio shared by microblogging as class cluster.

(3) temporal aspect

Vocabulary in different type microblogging class cluster class cluster lifetime (class cluster microblogging deliver earliest time to the latest when Between) different temporal characteristics can be presented.High frequency words in event microblogging class cluster often show " outburst " feature, and part " periodicity " feature can be then presented in high frequency words in non-event microblogging class cluster.The high frequency words in each microblogging class cluster are counted herein in mistake The frequency histogram of frequency of occurrence generation according to time sequence in 72 hours is removed, and special based on the two class timing that this feature calculates class cluster Sign.First category feature be high frequency words expectation deviation, first statistics current time class cluster in each high frequency words frequency of occurrence with It is expected that the difference of the frequency, and by the difference divided by each hour in class cluster microblogging quantity, wherein the expectation frequency is small by the past 72 When the interior frequency mean value computation obtain, distribute weight to each high frequency words according to the frequency information that class cluster medium-high frequency word occurs, most The high frequency words expectation deviation of weighting class cluster is obtained eventually.Second category feature is the fitting journey of high frequency words histogram distribution and exponential function The characteristic of exponential function distribution characteristics is presented based on the hot spot word in social networks for degree, this feature, can benefit in the embodiment of the present invention It is fitted the corresponding exponential distribution function of distribution histogram of high frequency words with least square method, and calculates R²Statistic is quasi- to measure Conjunction degree.

Based on the theme principle of correspondence, the media data in event class cluster is cleared up, in purifying medium data procedures, thing Most of microblogging is to be closely related with event described in class cluster, but often exist simultaneously in class cluster in part microblogging class cluster Or spurious correlation blog article uncorrelated to event topic needs to clear up them to improve the quality of event detection.The present invention is real The evolution of media data (microblog data) can be realized by following content by applying example, carry out participle to microblogging first and stop words is gone It removes, is class cluster mass center according to word frequency height selects word frequency to be greater than the vocabulary of given threshold value in class cluster；Then according to the TF- of blog article IDF (term frequency-inverse document frequency) calculates term weight, and the weight of mass center vocabulary obtains microblogging and class cluster in cumulative single microblogging The similarity of mass center；Microblogging by similarity lower than specified threshold is removed from class cluster.

Based on above-mentioned method, the embodiment of the present invention also provides a kind of more tactful media datas towards event detection and flows through Device is filtered, it is shown in Figure 3, include: off-line training module, filtering module, cluster module and cleaning module, wherein

It is shown in Figure 4, after off-line phase establishes junk user database and spam application blacklist list, online rank Section: first according to the user of microblogging and the microblogging delivered and forwarded using source filtering junk user, as long as the author of microblogging It is present in junk user database or microblogging is derived from using blacklist, just directly filters the blog article, otherwise enter next ring Section；Content information based on microblogging filters non-event microblogging；On-line talking is carried out to by the microblogging of double-filtration, forms microblogging Class cluster；Non-event class cluster is filtered according to the theme feature, social characteristics, temporal characteristics of microblogging class cluster；Based on subject consistency original Then the low quality microblogging in event class cluster is cleared up.Blog article semanteme is had ignored in microblogging track of issues for conventional method Information, the problem for causing tracking effect not ideal enough propose a kind of microblogging based on Wiki knowledge in conjunction with microblogging text feature Track of issues method.The feature vector of expression event and blog article is mapped by word space to wikipedia entity space.One side Event (blog article) Feature Words are replaced with Wiki entity, extend feature vocabulary by face；On the other hand, the process of mapping is also to disappear The process influenced except synonym and polysemant.It is compared with the traditional method, this case method can make full use of Wiki knowledge semantic letter Breath, therefore, performance are better than control methods.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that A specific embodiment of the invention is only limitted to this, for those of ordinary skill in the art to which the present invention belongs, is not taking off Under the premise of from present inventive concept, several simple deduction or replace can also be made, all shall be regarded as belonging to the present invention by institute Claims of submission determine scope of patent protection.The foregoing description of the disclosed embodiments makes this field professional technique Personnel can be realized or use the application.Various modifications to these embodiments will be for those skilled in the art It will be apparent that the general principles defined herein can without departing from the spirit or scope of the application, at it It is realized in its embodiment.Therefore, the application is not intended to be limited to the embodiments shown herein, and is to fit to and this paper The consistent widest scope of disclosed principle and features of novelty.

Claims

1. a kind of more tactful media data filtration methods towards event detection, which is characterized in that include following content:

Off-line phase constructs junk user database and applies source blacklist list according to the media subscriber data being collected into；

The online recognition stage carries out matchmaker by junk user database and application source blacklist list for media data flow Volume data filtering is filtered non-event media data by media content and contextual feature, and is gathered online to media data Class, identification events class cluster purify the media data in event class cluster.

2. more tactful media data filtration methods according to claim 1 towards event detection, which is characterized in that from It in the line stage, is filtered based on the media data in user and source, the media data collecting individual subscriber social networks and delivering mentions User behavior characteristics and media data content feature are taken, it is offline to construct junk user database and apply source blacklist list, Junk user is identified by supervision machine study；Judge that media subscriber whether there is in junk user in media data flow Whether database or media data derive from using source blacklist list, are directly filtered to media data.

3. more tactful media data filtration methods according to claim 2 towards event detection, which is characterized in that use Family behavioural characteristic includes user's reputation degree, forward rate and liveness, and user's reputation degree is according to user's number of fans, user's perpetual object The perpetual object quantity of quantity, the bean vermicelli quantity of bean vermicelli user and bean vermicelli user obtains, and forward rate is delivered a plurality of by user Media data proportion is forwarded to obtain in media data, liveness delivers the number of days and use that media data is crossed over according to user It registers number of days and obtains in family.

4. more tactful media data filtration methods according to claim 2 towards event detection, which is characterized in that matchmaker Volume data content characteristic is more comprising short chain feature, label characteristics, blog article length characteristic, blog article multiplicity feature, blog article word Sample feature and it is forwarded comment rate, the media number containing URL in several media datas that short chain feature is delivered according to user According to accounting obtain；Media data containing hot topic label in several media datas that label characteristics are delivered according to user Accounting obtains；The average length and length variance for several media datas that blog article length characteristic is delivered according to user are calculated； The average value of the cosine similarity of several media datas that blog article multiplicity feature is delivered according to user between any two obtains；Blog article Several media datas issued with word Biodiversity Characteristics by counting user, and according to non-duplicate character number and each non-heavy Character frequency of occurrence and the total character ratio three of media data are answered to obtain；If being forwarded what comment rate was issued by counting user The accounting that the sum of number is forwarded and commented in dry media data obtains.

5. more tactful media data filtration methods according to claim 1 towards event detection, which is characterized in that Media data flow is directed in line cognitive phase, firstly, carrying out matchmaker by junk user database and application source blacklist list Volume data filtering；Then, two classification are carried out to media data using media content and contextual feature, filters non-event media number According to；Clustering is carried out to the similar media data of theme, extracts class cluster feature, identification events class cluster, wherein class cluster feature is extremely It less include class cluster time and class cluster theme；And it is based on the theme principle of correspondence, the media data in event class cluster is cleared up, only Change media data.

6. more tactful media data filtration methods according to claim 5 towards event detection, which is characterized in that mistake Non-event media data is filtered, includes following content: firstly, media data flow is carried out at on-line talking by non-supervisory machine learning Reason obtains media class cluster, extracts class cluster feature；Then, learn to carry out model training using supervision machine, pass through trained mould Type is filtered non-event media class cluster.

7. more tactful media data filtration methods according to claim 6 towards event detection, which is characterized in that class Cluster feature include in theme feature, social characteristics and temporal aspect, wherein theme feature, by media data and class cluster The average value and method of the cosine similarity of the heart obtain；Social characteristics, by counting in each media class cluster comprising forwarding, commenting By, reply and refer to that shared ratio obtains；Temporal aspect, by the frequency of occurrence and on time for counting media class cluster medium-high frequency word Between the sort frequency histogram of generation obtain.

8. more tactful media data filtration methods according to claim 7 towards time detection, which is characterized in that when Sequence characteristics include following two category feature: 1) high frequency words it is expected deviation, count the appearance frequency of each high frequency words in current time class cluster The secondary difference with the desired frequency, and the difference and media data quantity hourly in class cluster are divided by, wherein expectation frequency root It is calculated according to the mean value of the frequency in historical time section, the frequency information occurred according to class cluster medium-high frequency word is to each high frequency Word distributes weight, obtains the high frequency words expectation deviation of weighting class cluster；2) the fitting journey of high frequency words histogram distribution and exponential function The characteristic of exponential distribution feature is presented based on the hot spot word in social networks for degree, utilizes point of least square method fitting high frequency words The corresponding exponential distribution function of cloth histogram measures the quasi- of high frequency words histogram distribution and exponential function by Counting statistics amount Conjunction degree.

9. more tactful media data filtration methods according to claim 5 towards event detection, which is characterized in that net Change media data, include following content: participle being carried out to media data and stop words removes, according to word frequency height in class cluster, choosing Vocabulary of the word frequency greater than given threshold value is selected as class cluster mass center；Term weight is calculated according to blog article term frequency-inverse document frequency, and is tired out The weight for adding mass center vocabulary in single media data obtains the similarity of media data Yu class cluster mass center；By similarity lower than finger The media data for determining threshold value is removed from class cluster.

10. a kind of more tactful media data flow filter devices towards event detection are, characterized by comprising: off-line training mould Block, filtering module, cluster module and cleaning module, wherein

Off-line training module, for constructing junk user database and application source being black according to the media subscriber data being collected into List list；

Filtering module, for being directed to media data flow, being primarily based on junk user database and applying source blacklist list mistake Media data is filtered, two classification then are carried out to media data using media content and contextual feature, filter non-event media number According to；

Cluster module extracts class cluster feature, identification events class for carrying out on-line talking analysis to the similar media data of theme Cluster, wherein class cluster feature includes at least class cluster time and class cluster theme；