CN108959484B

CN108959484B - Multi-strategy media data stream filtering method and device for event detection

Info

Publication number: CN108959484B
Application number: CN201810645129.6A
Authority: CN
Inventors: 陈刚; 唐永旺; 魏晗; 席耀一; 郭志刚; 袁江林
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2020-07-28
Anticipated expiration: 2038-06-21
Also published as: CN108959484A

Abstract

The invention relates to a multi-strategy media data stream filtering method and a device thereof for event detection, wherein the method comprises the following steps: in the off-line stage, a junk user database and an application source blacklist are constructed according to the collected media user data; and in the online identification stage, aiming at the media data stream, filtering the media data through a junk user database and an application source blacklist list, filtering non-event media data through media content and context characteristics, performing online clustering on the media data, identifying event clusters, and purifying the media data in the event clusters. The method and the device effectively solve the influence of noise data and other non-event data in the microblog data stream on microblog event detection, can clean most non-event microblogs in the microblog data stream, effectively improve microblog event detection performance, have strong real-time performance and practicability, are convenient for timely extracting hot topics and emergent events, and have important guiding significance on a new media data stream processing technology.

Description

Multi-strategy media data stream filtering method and device for event detection

Technical Field

The invention belongs to the technical field of media data processing, and particularly relates to a multi-strategy media data stream filtering method and device for event detection.

Background

As a typical representative of emerging media, microblogs are an important platform for conveniently and quickly publishing viewpoints, sharing and spreading information. Due to the convenience, instantaneity and interactivity of the microblog, the microblog is superior to the traditional media and portal websites for reporting and spreading hot spots and important events concerned by many people, and the microblog becomes an important information source for industries such as information collection, marketing, public opinion monitoring and the like. By utilizing a microblog-oriented event detection technology, the current social hot topics and major emergencies can be extracted from massive microblog data, so that users can be better helped to know news trends and master major events happening nearby. However, in addition to some hot news and emergency reports, the microblogs are also full of a lot of useless information, including advertisement information, daily life trivia, network rumors, spam generated automatically by the server, and the like, and how to distinguish the spam information from the meaningful event microblogs has become one of the main challenges in the detection of microblog data stream events. Aiming at the problem, the existing scholars purify microblog data through some filtering strategies to improve the microblog event detection performance, purify microblog data streams to a certain extent and improve the event detection performance, but the filtering strategies used by the scholars are single, the purification effect is limited, and the purification effect cannot be evaluated.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multi-strategy media data stream filtering method and device for event detection, which can be used for cleaning most of non-event microblogs in a microblog data stream, effectively improving microblog event detection performance and better helping a user to know news dynamics.

According to the design scheme provided by the invention, the method for filtering the multi-strategy media data stream facing the event detection comprises the following contents:

in the off-line stage, a junk user database and an application source blacklist are constructed according to the collected media user data;

and in the online identification stage, aiming at the media data stream, filtering the media data through a junk user database and an application source blacklist list, filtering non-event media data through media content and context characteristics, performing online clustering on the media data, identifying event clusters, and purifying the media data in the event clusters.

In the off-line stage, filtering media data based on users and sources, collecting personal social relations and published media data of the users, extracting user behavior characteristics and media data content characteristics, constructing a junk user database and an application source blacklist in an off-line manner, and identifying junk users through supervised machine learning; and judging whether the media user in the media data stream exists in a junk user database or not, or whether the media data come from an application source blacklist, and directly filtering the media data.

Preferably, the user behavior characteristics include a user reputation degree, a forwarding rate and an activity degree, the user reputation degree is obtained according to the number of user fans, the number of objects concerned by the user, the number of fans of the fan user and the number of objects concerned by the fan user, the forwarding rate is obtained according to the proportion of the forwarded media data in the plurality of pieces of media data published by the user, and the activity degree is obtained according to the number of days spanned by the media data published by the user and the number of days registered by the user.

Preferably, the media data content characteristics comprise short link characteristics, label characteristics, blog length characteristics, blog repetition characteristics, blog character diversity characteristics and forwarded comment rates, the short link characteristics are obtained according to the proportion of media data containing UR L in a plurality of media data published by users, the label characteristics are obtained according to the proportion of media data containing hot topic labels in the plurality of media data published by users, the blog length characteristics are obtained according to the average length and length variance of the plurality of media data published by users, the blog repetition characteristics are obtained according to the average value of cosine similarity between every two of the plurality of media data published by users, the blog character diversity characteristics are obtained by counting the number of the media data published by users, the number of non-repeated characters, the occurrence frequency of each non-repeated character and the total character ratio of the media data, and the forwarded comment rates are obtained by counting the proportion of the sum of forwarded and comment numbers in the plurality of media data published by users.

In the above-mentioned online identification stage, for the media data stream, first, media data filtering is performed through the garbage user database and the application source blacklist; secondly, performing secondary classification on the media data by using the media content and the context characteristics, and filtering non-event media data; performing clustering analysis on media data with similar themes, extracting cluster characteristics, and identifying event clusters, wherein the cluster characteristics at least comprise cluster time and cluster themes; and based on the principle of consistent themes, cleaning the media data in the event cluster and purifying the media data.

The filtering non-event media data includes the following contents: firstly, carrying out online clustering processing on a media data stream through unsupervised machine learning to obtain a media cluster, and extracting cluster characteristics; then, model training is carried out by using supervised machine learning, and the non-event media cluster is filtered through the trained model.

Preferably, the cluster-like features include topic features, social features and time sequence features, wherein the topic features are obtained by an average value and a method of cosine similarity between media data and a cluster-like center; the social characteristics are obtained by counting the proportion of forwarding, commenting, replying and mentioning in each media cluster; and the time sequence characteristics are obtained by counting the occurrence frequency of the high-frequency words in the media cluster and generating a frequency histogram according to time sequence.

Further, the timing characteristics include the following two types of characteristics: 1) calculating the difference value between the occurrence frequency and the expected frequency of each high-frequency word in the cluster class at the current moment, and dividing the difference value by the quantity of the media data in the cluster class per hour, wherein the expected frequency is obtained by calculating according to the average value of the frequency in the historical time period, and the high-frequency word is assigned with weight according to the frequency information of the high-frequency words in the cluster class to obtain the high-frequency word expected deviation of the weighted cluster class; 2) the fitting degree of the histogram distribution of the high-frequency words and the exponential function is measured by utilizing a least square method to fit the exponential distribution function corresponding to the distribution histogram of the high-frequency words based on the characteristic that hot words in the social network present exponential distribution characteristics and calculating statistics.

The above-mentioned purified media data includes the following contents: performing word segmentation and stop word removal on the media data, and selecting words with the word frequency larger than a given threshold value as class cluster centroids according to the word frequency in the class clusters; calculating the weight of the vocabulary according to the word frequency-inverse document frequency of the Bo-Wen, and accumulating the weight of the centroid vocabulary in the single media data to obtain the similarity between the media data and the centroid of the cluster; media data having a similarity below a specified threshold are removed from the cluster class.

An event detection oriented multi-policy media data stream filtering apparatus, comprising: an off-line training module, a filtering module, a clustering module and a purifying module, wherein,

the offline training module is used for constructing a junk user database and an application source blacklist according to the collected media user data;

the filtering module is used for filtering media data based on a junk user database and an application source blacklist list aiming at the media data stream, then performing secondary classification on the media data by utilizing media content and context characteristics, and filtering non-event media data;

the clustering module is used for carrying out online clustering analysis on the media data with similar themes, extracting cluster characteristics and identifying event clusters, wherein the cluster characteristics at least comprise cluster time and cluster themes;

and the purification module is used for cleaning the media data in the event cluster based on the theme consistency principle and purifying the media data.

The invention has the beneficial effects that:

according to the method, a junk user database is established in an off-line mode based on behavior and content characteristics of microblog users, and microblog information from junk users and application blacklists is filtered according to the database and a junk application blacklist list; classifying the microblogs for two times by utilizing the microblog content and the context characteristics, and filtering most of the non-event microblogs; clustering analysis is carried out on microblogs with similar themes by means of an online clustering technology, and various characteristics such as cluster time, theme and the like are extracted to identify event clusters; cleaning low-quality microblogs in the event cluster based on a theme consistency principle; the influence of noise data and other non-event data in the microblog data stream on microblog event detection is effectively solved, most non-event microblogs in the microblog data stream can be cleaned, microblog event detection performance is effectively improved, instantaneity and practicability are high, current social hot topics and major emergency events can be conveniently and timely extracted, and the method has important guiding significance for a new media data stream event monitoring technology.

Description of the drawings:

FIG. 1 is a schematic flow chart of the method in the example;

FIG. 2 is a flow chart of media data stream filtering during an online identification phase according to an embodiment;

FIG. 3 is a schematic view of an embodiment of the apparatus;

FIG. 4 is a schematic diagram of the operation of the apparatus according to the embodiment;

the specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

Aiming at the existing filtering technology of media data such as microblog content, microblog data are purified through some filtering strategies to improve microblog event detection performance, microblog data streams are purified to a certain extent, and event detection performance is improved. To this end, referring to fig. 1, an embodiment of the present invention provides a method for filtering a multi-policy media data stream oriented to event detection, including the following steps:

The multi-strategy media data filtering oriented to event detection comprises an off-line stage and an on-line stage, the off-line part realizes discovery of junk users, construction of junk user databases and application source blacklist lists, and the on-line part can filter media data streams in real time, effectively solves the influence of noise data and other non-event data in microblog data streams on microblog event detection, and improves microblog event detection performance.

In the off-line stage, personal social relationship information of the user and all microblogs published in the recent period are collected, two types of characteristics of user behaviors and microblog contents are extracted, and the spam user is found by using a supervised machine learning algorithm. The user behavior characteristics comprise the following contents:

(1) reputation of user

The junk users and the normal users have larger difference in the number of the attention objects and the owned fans. In order to improve the propagation range, junk users, particularly advertising users, often add more attention objects, but because the users generally have low microblog quality and are difficult to have a large number of fans, the ratio of the number of fans to the attention objects can reflect the reputation of the users to a certain extent. On the other hand, the reputation of the user is also in a large relationship with the quality of the fans, the reputation of the fans owned by the non-garbage users is generally high, and the fans of the garbage users are either few in quantity or low in quality (the fans themselves may be garbage users). Defining the reputation of the user as follows:

wherein the content of the first and second substances,

indicates the number of fans of user u,

representing the number of objects of interest of user u,

representing the number of fans of the ith fan user of user u,

number of attention objects of ith fan user representing user uThe quantity, M, represents the number of fans of user u. The definition of the user authority adds the influence of the quality of the fans on the reputation of the user.

(2) Forward rate

The behavior of forwarding microblogs of other users by junk users is different from that of forwarding microblogs of normal users, the junk users, particularly advertisement users, often issue a large amount of original microblogs, and common non-junk users generally issue the original microblogs and also forward a large amount of microblogs of other users. According to actual needs, the proportion of the 100 microblogs published recently by the user to the forwarded microblogs can be defined as the microblog forwarding rate.

(3) Recent liveness

Newly registered spam users are easily identified by the microblog platform, and the spam users tend to spread spam messages by using users with registration time exceeding a period of time, which is represented by publishing a large amount of spam messages in a short time. This feature is characterized by the "recent activity" index, defined as follows:

wherein, according to the actual requirement, the method can be defined

The number of days over which the user u recently published 100 original microblogs,

is the number of days user u is registered. For a sleeping account which is suddenly activated (such as other common user accounts used by a number stealing person), the value of recent activity is higher; the feature value is low for a user that is always active or a general user that has been inactive for a long time.

The microblog content characteristics comprise the following contents:

(1) short chain linking characteristics

The microblog length is limited within 140 characters, in order to contain more information in limited word numbers, junk users often add short links to microblogs to transmit junk information, normal users generally add fewer links to original blog texts (news or media microblog accounts are an exception), the number of microblogs containing UR L in the microblogs recently published by the users is counted, the ratio of the number of the microblogs to the total number of the microblogs recently published by the users is defined as the short link characteristic of the user blog, and 100 microblogs recently published by the users can be selected according to actual needs.

(2) hashtag feature

The microblog uses hashtag to represent a topic, and the microblog added with the tag can be displayed to the fans of the users and can also be seen by all the users participating in topic discussion. In order to increase the exposure of the blog text, the spam users often pretend to participate in the discussion of the topic by adding a hot topic label, and add spam information to the blog text. According to actual needs, the proportion of the 100 microblogs published recently by the user, which contain hashtag, can be defined as the hashtag characteristic of the user's blog.

(3) Length of Bo Wen characteristic

In order to show as much information as possible in the microblogs, junk users often utilize the length limit of 140 characters of the microblogs as much as possible, published blog articles are generally longer, while regular users have published blog articles with different lengths, and the average length and the length variance of 100 microblogs recently published by users can be counted according to actual needs to represent the length characteristics of the blog articles.

(4) Bowen repetition feature

And the normal users rarely issue the blog articles with the same content, and the junk users often issue the blog articles with the basically same content or complete repeated blog articles in order to enlarge the influence range and the duration of the microblogs issued by the junk users. According to actual needs, the cosine similarity between every two 100 microblogs recently issued by a user can be calculated, and the average value of the similarity is defined as the characteristic of the microblog repeatability.

(5) Word diversity for Bo-Wen (entropy calculation)

The purpose of the microblog published by the junk user is different from that of the normal user, and the purpose is reflected in the characters and words contained in the blog published by the junk user. The attention subjects of normal users are relatively wide, the distribution range of characters or words used in the blog text is wider, the spam users often use the templates to generate the blog text, and the number of non-repeated characters and words in the blog text is relatively less. According to actual needs, 100 microblogs recently issued by a user can be collected, stop words and links in the microblogs are removed, the occurrence frequency of each non-repeated character is counted according to characters, the entropy of a microblog data set is calculated to serve as character use diversity characteristics of the Bowen, and the definition is as follows:

wherein M represents the number of non-repeated characters in 100 microblogs published recently by a user u, and p_u(i) And the ratio of the number of times of occurrence of ith non-repeated characters in 100 microblog data sets recently issued by the user u to the total characters of the microblog data is represented.

(6) Forwarded and commented rate

The microblogs issued by the junk users are mostly useless junk information and are less forwarded and commented by other users, the proportion of microblogs with the sum of the forwarded and commented numbers exceeding a set threshold value in 100 original microblogs issued recently by the users can be counted according to actual needs, and the forwarded and commented rates of the users are defined.

The identification of the junk users is a two-classification problem, and the identification of the junk users can be carried out through a supervised learning algorithm; in the embodiment of the invention, a Support Vector Machine (SVM) with excellent performance in a small sample data set can be selected as the classifier. The objective of the SVM is to construct an optimal hyperplane to minimize the classification error, transform the input space to a high dimensional space by appropriate nonlinear transformation of the kernel function, and then find the optimal classification plane in the new space. Since the generalization performance of the SVM depends on the selection of the kernel function parameters and the error penalty factors, an appropriate kernel function parameter is selected first to obtain good classification performance. There are three main types of kernel functions in common use: polynomial functions, Radial Basis Functions (RBFs), and Sigmod functions; by using a radial basis function SVM, which kernel function supports SVM classification models for any complex boundary, the open source of the SVM used is implemented as libSVM.

Microblog sources include two categories: a microblog publisher and an application program for publishing a microblog. By the aid of the junk user database and the application source blacklist which are constructed in an off-line mode, quick filtering of junk microblogs based on sources can be achieved. The filtration process is as follows: for one blog in the microblog data stream, extracting the user ID and the application source of the blog, searching a junk user database, checking whether the database stores the ID of the user, and if so, directly judging that the microblog is a junk microblog; and if the spam user database does not have the user ID, filtering the microblog according to the application source issuing the blog, if the user ID is in the application source blacklist, considering the microblog as a spam microblog, and otherwise, performing subsequent processing.

In another embodiment of the present invention, referring to fig. 2, in the online identification phase, the media data (microblog data) stream is filtered in real time, and the content includes the following contents:

s101, filtering media data through a junk user database and an application source blacklist;

s102, performing secondary classification on the media data by using the media content and the context characteristics, and filtering non-event media data;

s103, performing clustering analysis on media data with similar themes, extracting cluster characteristics, and identifying event clusters, wherein the cluster characteristics at least comprise cluster time and cluster themes;

and S104, based on the theme consistency principle, cleaning the media data in the event cluster and purifying the media data.

In the embodiment of the invention, the microwave filtering based on the content can be realized by a supervised machine learning method, the classifier selects the SVM, and the realization process can be designed as follows:

(1) preprocessing the blog, and removing word segmentation and stop words;

(2) feature extraction:

number of hyperlinks: advertisement spam microblogs often contain one or more short links, the links point to webpages such as company propaganda, commercial advertisements or adult advertisements, and ordinary microblogs often contain fewer links;

whether the author is a user with V: the V-added user is authenticated by the microblog system, so that the reliability is high, and junk microblogs are generally not issued at will; the junk users generally have difficulty in obtaining the authentication of the system;

the number of junk words: the spam microblogs often contain a large amount of relevant spam vocabularies such as advertisements or pornography, for example: the method comprises the following steps of giving more preference, covering postings in the whole field, buying more and sending more and the like, establishing a garbage vocabulary table, and counting the occurrence times of garbage vocabularies in a microblog;

bowen length: the junk microblog is different from the ordinary non-junk microblog in the length of the blog, so that the junk microblog is often long when more information is contained in the blog, and the junk microblog is very short and contains limited information in another situation, while the ordinary microblog is moderate in length.

The number of emotional words is: microblogs are a platform for exchanging information and publishing emotion of a large number of users, and words and emoticons for expressing emotion and evaluation of authors often appear in common microblogs, such as: give strength, happy feeling, blessing, [ lacrimation ], [ feeling ], etc. According to the embodiment of the invention, an emotion word list comprising the emotion words with the set threshold number and the emotion symbols with the set threshold number can be constructed, and then the number of times of the emotion words and the emotion symbols in the microblog is counted.

Whether the number of hashtags exceeds one: the microblog uses hashtag to represent a topic, some spam microblogs can contain tags of one or more hot topics in order to enlarge the propagation influence, and ordinary microblogs generally do not contain more than one tag data.

The number of @ is: the number of ordinary microblogs containing @ is limited, and spam microblogs are often added with a plurality of @ in a blog text to mention a plurality of users in order to enlarge the propagation range.

Whether the user is a microblog head @ other user: the normally forwarded microblog automatically adds an @ identifier (//@) to the microblog head;

whether @ other users are in the middle of the microblog: the microblog can be actively sent to other users by inserting the '@ username' in the middle of the microblog;

number of named entities: named entities such as time, place, people and the like appear in the blog text of the event-type microblog, and the probability of the named entities in the spam microblog is low. Preprocessing the microblog, removing hashtags, links and mentions in the microblog, then performing word segmentation and part-of-speech tagging on the microblog, and calculating the ratio of named entities in the microblog to the length of the whole microblog.

Number of special punctuation marks: punctuation marks in common microblogs generally include commas, periods, semicolons, question marks and exclamation marks, and some special punctuation marks are added in some junk keywords with obvious characteristics in order to bypass the filtering mechanism of a microblog system, for example: special punctuations such as "", "&", "" and "", "" "" and "", and the like, the punctuations are less appeared in common microblogs, and the number of the special punctuations in the bobble text is counted to serve as an important characteristic of spam microblogs.

Whether it is a forwarding microblog: the junk microblog is generally less forwarded by other users, the chance of forwarding the ordinary microblog is larger, and whether the microblog is the forwarding microblog is taken as an important characteristic for distinguishing the junk microblog from the ordinary microblog.

Forward comment ratio: the forwarding and comment times of the microblogs are directly influenced by the acquisition time and are not suitable for being used as a single distinguishing characteristic, and the ratio of the forwarded and comment times of the microblogs is used for distinguishing junk microblogs from general microblogs.

In the process of identifying the spam microblogs based on the SVM, 1000 microblogs can be randomly selected from microblog data streams, whether the microblog data streams are spam or not is artificially marked, and a plurality of spam microblogs and a plurality of common non-spam microblogs are obtained. The data set of the 1000 microblogs was divided into 10, and the data set was trained and tested using 10 cross-validation.

The online clustering is an efficient unsupervised machine learning algorithm and is widely applied to microblog event detection. In the embodiment of the invention, clustering analysis is carried out on media data with similar themes, after microblog data streams are subjected to online clustering processing, some microblog clusters are obtained, microblogs in the clusters have higher similarity, and the microblog similarity between the clusters is lower; extracting three types of characteristics of the class clusters, training a recognition model by using a supervised machine learning algorithm, and realizing the filtering of the non-event microblog class clusters, wherein the three types of characteristics of the class clusters are as follows:

(1) subject matter feature

The topic features describe the consistency information of the topics in the microblog class clusters, and generally, it is assumed that the topics contained in the event class clusters are relatively concentrated, but the topics contained in the non-event class clusters are relatively dispersed. In addition, the co-occurrence vocabulary contained in the event class cluster is different from that contained in the non-event class cluster. Microblogs in an event class cluster often contain more co-occurrence vocabularies, and the vocabularies describe a common theme from different angles; and the non-event cluster contains a smaller number of co-occurrence words, but the co-occurrence words point to different subjects, such as sleeping, working and the like. According to the embodiment of the invention, the consistency of the microblog cluster theme can be represented by calculating the average value and the variance of the cosine similarity between the microblog and the cluster center; meanwhile, the occurrence frequency of different vocabularies in the class cluster is counted, and the percentage of microblogs containing the vocabularies with the highest frequency and the set threshold number is calculated.

(2) Social features

Social behaviors among microblog users include forwarding, commenting, replying, and mentioning. Forwarding means that a user issues original contents of microblogs issued by original authors in a microblog space of the user, and comments on the original microblogs can be added by the user while forwarding; the reply is a separate reply of the user to the comments of other users; mention is made of the user sending a microblog orientation to a specified user by the @ identity. The forwarding, commenting, replying and mentioning of the four social behaviors of the users in the event-type microblog and the non-event-type microblog are different, such as: microblogs issued by microbo V are often forwarded in large quantities, but these microblogs may all be personal status updates issued by large V, without containing any event information. According to the embodiment of the invention, the social characteristics of each microblog class cluster can be obtained by counting the proportion of forwarding, commenting, replying and mentioning the microblog in each microblog class cluster.

(3) Timing characteristics

When vocabularies in different types of microblog clusters are different in life periods (from the earliest time to the latest time of microblog publication in the clusters) of the clustersThe order property. High-frequency words in the event microblog cluster often show a 'burst' characteristic, and high-frequency words in a part of non-event microblog clusters show a 'periodic' characteristic. The method comprises the steps of counting the occurrence frequency of high-frequency words in each microblog class cluster in the past 72 hours, generating a frequency histogram which is sorted according to time, and calculating two types of time sequence characteristics of the class clusters based on the characteristics. The first type of characteristics are expected deviations of high-frequency words, firstly, the difference value of the occurrence frequency and the expected frequency of each high-frequency word in the class cluster at the current moment is counted, the difference value is divided by the number of microblogs in each hour in the class cluster, wherein the expected frequency is obtained by calculating the average value of the frequency in the past 72 hours, weights are distributed to the high-frequency words according to the frequency information of the high-frequency words in the class cluster, and finally the expected deviations of the high-frequency words in the weighted class cluster are obtained. The second type of characteristics is the fitting degree of histogram distribution of high-frequency words and an exponential function, the characteristics are based on the characteristic that hot words in the social network present the distribution characteristics of the exponential function, in the embodiment of the invention, the least square method can be used for fitting the exponential distribution function corresponding to the distribution histogram of the high-frequency words, and R is calculated²Statistics to measure the degree of fit.

Based on the principle of consistent themes, media data in the event cluster are cleaned, in the process of cleaning the media data, most microblogs in the event microblog cluster are closely related to events described by the cluster, but some messages irrelevant to event themes or pseudo-relevant messages often exist in the cluster at the same time, and need to be cleaned to improve the quality of event detection. The embodiment of the invention can realize the evolution of media data (microblog data) by firstly carrying out word segmentation and stop word removal on a microblog and selecting a vocabulary with the word frequency larger than a given threshold value as a cluster centroid according to the word frequency in the cluster; then, calculating the weight of words according to TF-IDF (word frequency-inverse document frequency) of the Bowen, and accumulating the weight of the words of the mass center in a single microblog to obtain the similarity between the microblog and the mass center of the cluster; and removing microblogs with the similarity lower than a specified threshold value from the class cluster.

Based on the foregoing method, an embodiment of the present invention further provides an event detection-oriented multi-policy media data stream filtering apparatus, as shown in fig. 3, including: an off-line training module, a filtering module, a clustering module and a purifying module, wherein,

Referring to fig. 4, after the offline stage establishes the spam user database and the spam application blacklist, the online stage: firstly, filtering microblogs issued and forwarded by junk users according to users and application sources of the microblogs, directly filtering the blog messages as long as authors of the microblogs exist in a junk user database or the microblogs come from an application blacklist, and otherwise, entering the next link; filtering non-event microblogs based on the content information of the microblogs; performing online clustering on the microblogs subjected to the two-stage filtering to form a microblog cluster; filtering the non-event cluster according to the theme characteristic, the social characteristic and the time characteristic of the microblog cluster; and cleaning low-quality microblogs in the event cluster based on the theme consistency principle. Aiming at the problem that the tracking effect is not ideal enough due to the fact that the semantic information of the blog is ignored in the microblog event tracking in the traditional method, the microblog event tracking method based on the wiki knowledge is provided by combining the characteristics of microblog texts. Feature vectors representing events and Bowens are mapped through the word space to the Wikipedia entity space. On one hand, the event (Bo Wen) feature words are replaced by wiki entities, and a feature word list is expanded; on the other hand, the process of mapping is also a process of eliminating the influence of synonyms and polysemons. Compared with the traditional method, the method can fully utilize the semantic information of the wiki knowledge, so that the performance is superior to that of the comparison method.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-strategy media data stream filtering method oriented to event detection is characterized by comprising the following contents:

in the online identification stage, aiming at media data flow, media data filtering is carried out through a junk user database and an application source blacklist list, non-event media data is filtered through media content and context characteristics, online clustering is carried out on the media data, event clusters are identified, and the media data in the event clusters are purified;

in the online identification stage, for a media data stream, firstly, media data filtering is carried out through a junk user database and an application source blacklist; secondly, performing secondary classification on the media data by using the media content and the context characteristics, and filtering non-event media data; performing cluster analysis on media data with similar topics, extracting cluster characteristics, and identifying event clusters, wherein the extracted cluster characteristics at least comprise cluster extraction time and cluster topics; based on the principle of consistent themes, cleaning the media data in the event cluster and purifying the media data;

filtering non-event media data, comprising the following: firstly, carrying out online clustering processing on a media data stream through unsupervised machine learning to obtain a media cluster, and extracting cluster characteristics; then, model training is carried out by using supervised machine learning, and non-event media clusters are filtered through the trained models;

the cluster-like characteristics comprise theme characteristics, social characteristics and time sequence characteristics, wherein the theme characteristics are obtained by an average value and a method of cosine similarity between the media data and a cluster-like center; the social characteristics are obtained by counting the proportion of forwarding, commenting, replying and mentioning in each media cluster; the time sequence characteristics are obtained by counting the occurrence frequency of high-frequency words in the media cluster and sequencing the occurrence frequency according to time to generate a frequency histogram;

the timing characteristics include the following two types of characteristics: 1) calculating the difference value between the occurrence frequency and the expected frequency of each high-frequency word in the cluster class at the current moment, and dividing the difference value by the quantity of the media data in the cluster class per hour, wherein the expected frequency is obtained by calculating according to the average value of the frequency in the historical time period, and the high-frequency word is assigned with weight according to the frequency information of the high-frequency words in the cluster class to obtain the high-frequency word expected deviation of the weighted cluster class; 2) the fitting degree of the histogram distribution of the high-frequency words and the exponential function is measured by utilizing a least square method to fit the exponential distribution function corresponding to the distribution histogram of the high-frequency words based on the characteristic that hot words in the social network present exponential distribution characteristics and calculating statistics.

2. The method for filtering multi-strategy media data stream facing event detection as claimed in claim 1, wherein in an off-line stage, the media data filtering based on users and sources is adopted, the personal social relationship and published media data of the users are collected, the user behavior characteristics and the media data content characteristics are extracted, a junk user database and an application source blacklist are constructed off-line, and junk users are identified through supervised machine learning; and judging whether the media user in the media data stream exists in a junk user database or not, or whether the media data come from an application source blacklist, and directly filtering the media data.

3. The event detection-oriented multi-strategy media data stream filtering method as claimed in claim 2, wherein the user behavior characteristics include user reputation, forwarding rate and liveness, the user reputation is obtained according to the number of user fans, the number of objects of interest to the user, the number of fan users and the number of objects of interest to the fan users, the forwarding rate is obtained according to the proportion of the forwarded media data in the plurality of pieces of media data published by the user, and the liveness is obtained according to the number of days spanned by the media data published by the user and the number of user registration days.

4. The method for filtering the multi-strategy media data stream oriented to the event detection as claimed in claim 2, wherein the media data content features comprise short link features, label features, blog length features, blog repetition features, blog character diversity features and forwarded comment rates, the short link features are obtained according to the proportion of media data containing UR L in a plurality of media data published by users, the label features are obtained according to the proportion of media data containing hot topic labels in a plurality of media data published by users, the blog length features are obtained according to the average length and length variance of a plurality of media data published by users, the blog repetition features are obtained according to the average value of cosine similarity between a plurality of media data published by users, the word diversity features for the blog are obtained by counting a plurality of media data published by users, according to the number of non-repeated characters, the occurrence number of each non-repeated character and the total character ratio of the media data, and the forwarded comment rates are obtained by counting the proportion of the sum of forwarded and comment numbers in a plurality of media data published by users.

5. The method for filtering multi-strategy media data stream facing to event detection as claimed in claim 1, wherein the media data is purified, comprising the following contents: performing word segmentation and stop word removal on the media data, and selecting words with the word frequency larger than a given threshold value as class cluster centroids according to the word frequency in the class clusters; calculating the weight of the vocabulary according to the word frequency-inverse document frequency of the Bo-Wen, and accumulating the weight of the centroid vocabulary in the single media data to obtain the similarity between the media data and the centroid of the cluster; media data having a similarity below a specified threshold are removed from the cluster class.

6. An event detection-oriented multi-policy media data stream filtering device, which is implemented based on the event detection-oriented multi-policy media data stream filtering method of claim 1, and comprises: an off-line training module, a filtering module, a clustering module and a purifying module, wherein,