CN107273496B

CN107273496B - A detection method for regional emergencies in Weibo network

Info

Publication number: CN107273496B
Application number: CN201710455550.6A
Authority: CN
Inventors: 仲兆满; 管燕; 李存华
Original assignee: Jiangsu Ocean University
Current assignee: Jiangsu Jinge Network Technology Co ltd; Jiangsu Ocean University
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2020-07-28
Anticipated expiration: 2037-06-15
Also published as: CN107273496A

Abstract

The invention discloses a detection method of a regional emergency of a microblog network, which comprises the steps of (1) acquiring a regional microblog from a microblog network to obtain a microblog set P L MB, preprocessing the microblog to obtain a microblog set L MB, (2) extracting emergency words from the microblog set L MB to obtain an emergency word set EW, and (3) clustering the emergency words in the EW to obtain an emergency word cluster EWC { EWC ═ E } E₁，ewc₂，…，ewc_qSuppose there are q word clusters. The method provided by the invention calculates the burst value of the word by using 4 types of indexes of word frequency, word-associated users, word distribution regions and word social behaviors, more reasonably utilizes the burst characteristics of microblog network words, and is more suitable for detecting the microblog network region emergency.

Description

A detection method for regional emergencies in Weibo network

技术领域technical field

本发明涉及一种信息挖掘技术，具体地说，涉及一种微博网络地域突发事件检测方法。The invention relates to an information mining technology, in particular to a method for detecting regional emergencies in a microblog network.

背景技术Background technique

微博作为实时性、交互性很强的社交媒体，为用户提供了自由发表内容以及信息交换的平台，已经成为人们爆料事件、发表观点、分享经验的首选媒体。现实中发生的很多事件在微博上都先有爆料，而后传统的主流媒体才予以报道，比如，2013年的波士顿爆炸事件、撒切尔夫人的离世等等事件。面向微博的事件检测已成为近期事件检测领域的研究热点。As a real-time and highly interactive social media, Weibo provides a platform for users to freely publish content and exchange information. Many events that happened in reality were first revealed on Weibo, and then the traditional mainstream media reported them, such as the Boston bombing in 2013, the death of Margaret Thatcher, and so on. Weibo-oriented event detection has become a research hotspot in the field of event detection recently.

由于微博的很多内容带有地域信息，包括博文提及的地点，发表博文的用户的注册地点，以及博文附带的地理标签等，面向微博的局部地域事件检测(Localized event)已经成为了新兴的研究方向。这类事件检测有一个基本假设，即当本地域没有事件发生的时候，用户很少会讨论此类事件，一旦发生了，就会有大量的讨论，比如地域发生火灾、爆炸、洪水、交通事故、污染、疾病传染等等事件。这与社交媒体的广域事件检测(Global event)有很大的不同，广域事件检测不考虑地域特性，面对的是媒体的整个信息流，不仅分析的工作量大，而且可能忽略了局部地域的热点事件，已有的事件检测方法难以直接应用到地域事件检测之中。Since many contents of Weibo contain regional information, including the location mentioned in the blog post, the registration location of the user who published the blog post, and the geotags attached to the blog post, etc., localized event detection for Weibo has become an emerging trend. research direction. This type of event detection has a basic assumption, that is, when no events occur in the local domain, users rarely discuss such events, and once they occur, there will be a lot of discussions, such as fires, explosions, floods, and traffic accidents in the region. , pollution, disease transmission, etc. This is very different from the global event detection of social media. The global event detection does not consider regional characteristics, and faces the entire information flow of the media, which not only requires a large amount of analysis work, but also may ignore local For regional hot events, the existing event detection methods are difficult to directly apply to regional event detection.

在2010年美国出版的会议论文集：2010年第19届国际万维网会议(19thInternational World Wide Web Conference),题目为：基于Twitter用户的地震检测-通过社交传感器实时检测事件(Earthquake shakes Twitter users:real-time eventdetection by social sensors),作者是Takeshi Sakaki,Makoto Okazaki,YutakaMatsuo，该文把每个Twitter用户模拟成无线传感器网络中的节点，用户发表与地震相关的博文的过程被抽象成无线传感器网络中的节点发布自身采集到的信息行为，再通过博文的时间和空间模型及后续的滤波处理，对地震是否发生进行确认。但该方法需要人工设计一些查询输入项，难以应用到非常规的突发事件的检测。Proceedings published in the United States in 2010: 19th International World Wide Web Conference, 2010, titled: Earthquake shakes Twitter users: real- time eventdetection by social sensors), the authors are Takeshi Sakaki, Makoto Okazaki, YutakaMatsuo, this article simulates each Twitter user as a node in the wireless sensor network, and the process of users publishing earthquake-related blog posts is abstracted into the wireless sensor network. The node publishes the information behavior it has collected, and then confirms whether the earthquake occurs through the time and space model of the blog post and subsequent filtering processing. However, this method needs to manually design some query input items, which is difficult to apply to the detection of unconventional emergencies.

在2016年中国出版的期刊：现代图书情报技术，题目为：基于地理坐标的微博事件检测与分析，作者是：李进华,安仲杰，该文使用了微博数据的发布数、转发数、评论数、用户活跃度和移动强度5个指标构建微博的特征。该方法在检测微博突发事件时，考虑到的微博类的社交媒体的特征并不全面，包括突发词的频率、地域突发性等，而且在计算各个指标时并没有给出具体的计算方法(包括形式化的公式等等)。The journal published in China in 2016: Modern Library and Information Technology, with the title: Detection and Analysis of Weibo Events Based on Geographic Coordinates, by Li Jinhua and An Zhongjie The characteristics of Weibo are constructed from five indicators: number, user activity and mobile intensity. When this method detects microblog emergencies, the characteristics of social media such as microblogs are not comprehensive, including the frequency of sudden words, regional suddenness, etc., and no specific information is given when calculating each indicator. The calculation method (including formal formulas, etc.).

在2016年美国出版的会议论文集：第39届国际ACM信息检索会议(39thInternational ACM SIGIR Conference on Research and Development in InformationRetrieval),题目为：GeoBurst：从地理标签推特流中实时监测区域事件(GeoBurst:Real-Time Local Event Detection in Geo-Tagged Tweet Streams),作者是Zhang Chao,ZhouGuangyu,Yuan Quan,Zhuang Honglei,Zheng Yu,Kaplan Lance,Wang Shaowen,HanJiawei，该文首先在查询窗口内识别一些重要微博作为中心轴点(Pivots)，进一步通过与历史数据在时空方面的比较得到突发事件。该方法是从微博文本信息的角度出发，由于微博比较短小，且用语不规范，直接从一些短小的单篇微博文本中难以提取出有效的特征。Proceedings published in the United States in 2016: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, titled: GeoBurst: Real-time Monitoring of Regional Events from Geo-Tagged Twitter Streams (GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams), the authors are Zhang Chao, ZhouGuangyu, Yuan Quan, Zhuang Honglei, Zheng Yu, Kaplan Lance, Wang Shaowen, HanJiawei, this article first identifies some important microblogs in the query window As the central pivot point (Pivots), emergent events are further obtained by comparing with historical data in space and time. This method is based on the microblog text information. Because microblogs are relatively short and the language is not standardized, it is difficult to extract effective features directly from some short single microblog texts.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题是针对现有技术的不足，提供一种新的微博网络地域突发事件的检测方法，该方法更合理的利用了微博网络词的突发特征，更适合微博网络地域突发事件的检测。The technical problem to be solved by the present invention is to aim at the deficiencies of the prior art, and to provide a new detection method of microblog network regional emergencies, which more reasonably utilizes the sudden characteristics of microblog network words and is more suitable for microblog Detection of regional emergencies in the blog network.

本发明所要解决的技术问题是通过以下的技术方案来实现的。本发明提供了一种微博网络地域突发事件的检测方法，其特点是，其具体步骤如下：The technical problem to be solved by the present invention is achieved through the following technical solutions. The invention provides a detection method for microblog network regional emergencies, which is characterized in that the specific steps are as follows:

A、从微博网络中采集地域微博，得到微博集合PLMB，对微博预处理后得到微博集合LMB；A. Collect regional microblogs from the microblog network, obtain the microblog set PLMB, and obtain the microblog set LMB after preprocessing the microblog;

B、从微博集合LMB中提取突发词，得到突发词集合EW；B. Extract the burst words from the microblog set LMB to obtain the burst word set EW;

C、对EW中的突发词进行聚类，得到突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}，假设有q个词簇；C. Cluster the emergent words in EW to obtain the emergent event word cluster EWC={ewc ₁ ,ewc ₂ ,...,ewc _q }, assuming that there are q word clusters;

本发明方法所述的步骤A中所述的从微博网络中采集地域微博，预处理后得到微博集合LMB，优选采用以下具体步骤：In the step A of the method of the present invention, collecting regional microblogs from the microblog network, and obtaining the microblog set LMB after preprocessing, preferably adopts the following specific steps:

A1、使用采集工具获取地域Localized的微博信息集合PLMB＝{plmb₁，plmb₂，…，plmb_m}，其中plmb_i(1≤i≤m)为每一条地域微博；m代表地域微博的条数；A1. Use the collection tool to obtain the localized microblog information set PLMB={plmb ₁ , plmb ₂ , ..., plmb _m }, where plmb _i (1≤i≤m) is each regional microblog; m represents the regional microblog the number of bars;

A2、对微博集合PLMB进行预处理，去除微博中链接网址、表情符号信息，去除长度小于5个字的微博，得到预处理后的微博集合LMB，LMB＝{lmb₁，lmb₂，…，lmb_n}，其中lmb_i(1≤i≤n)为每一条地域微博。A2. Preprocess the microblog set PLMB, remove the link URL and emoticon information in the microblog, remove the microblogs whose length is less than 5 words, and obtain the preprocessed microblog set LMB, LMB={lmb ₁ , lmb ₂ , ..., lmb _n }, where lmb _i (1≤i≤n) is each regional microblog.

本发明方法所述的步骤B中所述的从微博集合LMB中提取突发词，得到突发词集合EW，其优选的具体步骤如下：Extracting burst words from the microblog set LMB as described in step B of the method of the present invention to obtain burst word set EW, the preferred specific steps are as follows:

B1、对LMB中的每条微博lmb_i(1≤i≤n)进行分词，n代表微博的条数,去除停用词，保留名词、动词、地名、人名、专有名词，得到最终的词集合为LMBW＝{w₁，w₂…，w_r}，假设有r个词；B1. Perform word segmentation on each microblog lmb _i (1≤i≤n) in LMB, n represents the number of microblogs, remove stop words, keep nouns, verbs, place names, person names, proper nouns, and get the final The set of words is LMBW={w ₁ , w ₂ ..., w _r }, assuming there are r words;

B2、计算词w_i(1≤i≤r)的频率突发性，假设当前突发事件检测的时间点为k，选取之前的p个时刻的历史数据为参考，词w_i在k时间点的频率突发性定义为：

其中，分子

为词w_i在k时间点出现的频率，分母中的

B2. Calculate the frequency burstiness of word _wi (1≤i≤r), assuming that the time point of the current emergency event detection is k, select the historical data of the previous p moments as a reference, and the word _wi is at the k time point The frequency burstiness of is defined as:

Among them, the molecule

is the frequency of word _wi at time k, where in the denominator

B3、计算词w_i(1≤i≤r)的关联用户突发性，假设当前突发事件检测的时间点为k，选取之前的p个时刻的历史数据为参考，词w_i在k时间点的关联用户突发性定义为：

其中，分子

为k时间点，提及到词w_i的不同用户数量，分母中的

B3. Calculate the associated user emergencies of word _wi (1≤i≤r), assuming that the current emergency detection time point is k, select the historical data of the previous p moments as a reference, and the word _wi is at time k The associated user burstiness of a point is defined as:

Among them, the molecule

is the k time point, the number of different users who mentioned the word _wi , in the denominator

B4、计算词w_i(1≤i≤r)的地域突发性，词w_i在k时间点的分布地域突发性定义为：

其中，分子

为k时间点，提及到词w_i的不同地理标签的数量，分母中的

B4. Calculate the regional burstiness of word _wi (1≤i≤r), and the regional burstiness of word _wi at time k is defined as:

Among them, the molecule

for k time points, the number of distinct geotags that mention word _wi , in the denominator

B5、计算词w_i(1≤i≤r)的社交行为突发性，词w_i在k时间点的社交行为突发性定义为：

其中，分子

为k时间点，提及到词w_i的微博的转发数、评论数和阅读数之和，分母中的

B5. Calculate the suddenness of social behavior of word _wi (1≤i≤r). The suddenness of social behavior of word _wi at time k is defined as:

Among them, the molecule

is the sum of the number of retweets, comments and readings of the microblogs mentioning the word _wi at time point k, the denominator

B6、综合步骤B2、B3、B4、B5的四个突发性，最终得到一个词w_i在k时间点的突发值为：BurstyScore(w_i)＝α*F(w_i)+β*U(u|w_i)+χ*GT(gt|w_i)+δ*SB(sb|w_i)，其中，α、β、χ、δ为调节系数，用于调节四类指标的权重，α+β+χ+δ＝1，α≥0，β≥0，χ≥0，δ≥0；B6. Synthesize the four bursts of steps B2, B3, B4, and B5, and finally obtain the burst value of a word _wi at time k: BurstyScore( _wi )=α*F( _wi )+β* U(u|w _i )+χ*GT(gt|w _i )+δ*SB(sb|w _i ), where α, β, χ, and δ are adjustment coefficients, which are used to adjust the weights of the four types of indicators, α+β+χ+δ=1, α≥0, β≥0, χ≥0, δ≥0;

B7、在计算出每个词的突发值后，使用四分差选出n个突发词，构成突发词集合EW。四分差的距离计算方法为：IQS(EW)＝Q₃(EW)-Q₁(EW)。当一个词的突发值大于一定的阈值，则作为突发词，阈值的计算方法为：maxima(EW)＝Q₃(EW)+1.5×IQS(EW)。B7. After calculating the burst value of each word, use the quartile difference to select n burst words to form a burst word set EW. The distance calculation method of the quartile difference is: IQS(EW)=Q ₃ (EW)-Q ₁ (EW). When the burst value of a word is greater than a certain threshold, it is regarded as a burst word, and the calculation method of the threshold is: maxima(EW)=Q ₃ (EW)+1.5×IQS(EW).

本发明方法所述的一种微博网络地域突发事件检测方法，所述的步骤C中对EW中的突发词进行聚类，得到突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}，优选的具体步骤如下：In the method of the present invention, in the method for detecting emergencies in microblog network regions, in the step C, the emergent words in the EW are clustered to obtain the emergent event word cluster EWC={ewc ₁ ,ewc ₂ , ..., ewc _q }, the preferred specific steps are as follows:

C1、基于步骤B获取的突发特征集EW，构建突发词关联网络EWN＝(V,E)，V是突发词集合EW，E表示突发词之间的关联强度。突发词ew_i、ew_j关联强度是统计两个词在同一篇微博博文中共现的次数；C1. Based on the burst feature set EW obtained in step B, construct a burst word association network EWN=(V, E), where V is the burst word set EW, and E represents the strength of association between burst words. The association strength of sudden words ew _i and ew _j is to count the number of co-occurrences of the two words in the same Weibo post;

C2、突发词关联网络EWN构建完成后，使用开源的CLUTO工具包对EWN进行聚类，获取突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}，假设有q个词簇。C2. After the construction of the EWN of the burst word association network is completed, use the open source CLUTO toolkit to cluster the EWN to obtain the burst word cluster EWC={ewc ₁ ,ewc ₂ ,...,ewc _q }, assuming that there are q words cluster.

与现有技术相比，本发明提出了全面的利用微博网络的特征进行事件检测的指标，提出了利用词频率、词关联用户、词分布地域及词社交行为4类指标，计算词的突发值，更合理的利用了微博网络词的突发特征，更适合微博网络地域突发事件的检测。并给出了具体的计算方法，有很大的实用价值。Compared with the prior art, the present invention proposes a comprehensive index for event detection using the characteristics of the microblog network, and proposes four types of indicators, including word frequency, word-related users, word distribution area, and word social behavior, to calculate the prominence of words. It makes more reasonable use of the sudden characteristics of Weibo network words, and is more suitable for the detection of regional emergencies in Weibo network. And gives the specific calculation method, which has great practical value.

附图说明Description of drawings

图1是本发明的微博网络地域突发事件检测方法的一种流程图；Fig. 1 is a kind of flow chart of the microblog network regional emergency detection method of the present invention;

图2是图1中步骤101所述的从微博网络中采集地域微博，得到微博集合PLMB，对微博预处理后得到微博集合LMB的流程图；Fig. 2 is the flow chart of collecting regional microblogs from the microblog network described in step 101 in Fig. 1, obtaining the microblog set PLMB, and obtaining the microblog set LMB after preprocessing the microblog;

图3是图1中步骤102所述的从微博集合LMB中提取突发词，得到突发词集合EW的流程图；Fig. 3 is described in step 102 in Fig. 1 from micro-blog set LMB to extract burst word, obtains the flow chart of burst word set EW;

图4是图1中步骤103所述的对EW中的突发词进行聚类，得到突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}的流程图。FIG. 4 is a flowchart of clustering the emergent words in the EW described in step 103 in FIG. 1 to obtain the emergent event word cluster EWC={ewc ₁ , ewc ₂ , . . . , ewc _q }.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明的实施过程作进一步详细的描述。The implementation process of the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

参照图1，一种微博网络地域突发事件的检测方法，该方法包括如下步骤：Referring to Fig. 1, a detection method of microblog network regional emergencies, the method comprises the following steps:

步骤101、从微博网络中采集地域微博，得到微博集合PLMB，对微博预处理后得到微博集合LMB，参照图2，其具体步骤如下：Step 101: Collect regional microblogs from the microblog network, obtain a microblog set PLMB, and obtain a microblog set LMB after preprocessing the microblogs. Referring to FIG. 2, the specific steps are as follows:

步骤201、使用采集工具获取地域Localized的微博信息集合PLMB＝{plmb₁，plmb₂，…，plmb_m}，其中plmb_i(1≤i≤m)为每一条地域微博。在微博申请开发者权限后，调用API中不同接口，可以获取到某个位置周边的动态微博信息。调用位置服务接口可以获取返回的微博内容、转发数、评论数、点赞数、用户信息、签到地点等。Step 201 , using a collection tool to obtain a localized microblog information set PLMB={plmb ₁ , plmb ₂ , . . . , plmb _m }, where plmb _i (1≤i≤m) is each regional microblog. After applying for developer permission on Weibo, you can obtain dynamic Weibo information around a certain location by calling different interfaces in the API. Calling the location service interface can obtain the returned Weibo content, number of retweets, number of comments, number of likes, user information, check-in location, etc.

步骤202、对微博集合PLMB进行预处理，去除微博中链接网址、表情符号信息，去除长度小于5个字的微博，得到预处理后的微博集合LMB，LMB＝{lmb₁，lmb₂，…，lmb_n}，其中lmb_i(1≤i≤n)为每一条地域微博。采集到的地域微博中，虽然已经是从海量的微博中进行了有针对性的筛选，但其中还存在一些干扰信息，需要对其进行过滤，减少后期计算的复杂度。Step 202: Preprocess the microblog set PLMB, remove the link URL and emoticon information in the microblog, remove the microblogs whose length is less than 5 words, and obtain the preprocessed microblog set LMB, LMB={lmb ₁ , lmb ₂ , ..., lmb _n }, where lmb _i (1≤i≤n) is each regional microblog. In the collected regional microblogs, although targeted screening has been carried out from a large number of microblogs, there is still some interference information, which needs to be filtered to reduce the complexity of later calculation.

步骤102、从微博集合LMB中提取突发词，得到突发词集合EW,参照图3，其具体步骤如下：Step 102, extract burst words from the microblog set LMB, and obtain burst word set EW, with reference to FIG. 3, the specific steps are as follows:

步骤301、对LMB中的每条微博lmb_i(1≤i≤n)进行分词，去除停用词，保留名词、动词、地名、人名、专有名词，得到最终的词集合为LMBW＝{w₁，w₂，…，w_r}，假设有r个词。因为有些动词不具有实际意义，比如“举行、进行、开展、会”等等，进一步的去除其中的停用动词；Step 301, perform word segmentation on each microblog lmb _i (1≤i≤n) in the LMB, remove stop words, retain nouns, verbs, place names, personal names, proper nouns, and obtain the final word set as LMBW={ w ₁ , w ₂ , ..., w _r }, suppose there are r words. Because some verbs do not have actual meaning, such as "hold, carry out, carry out, meet", etc., further remove the stop verbs;

步骤302、计算词w_i(1≤i≤r)的频率突发性，假设当前突发事件检测的时间点为k，选取之前的p个时刻的历史数据为参考，词w_i在k时间点的频率突发性定义为：Step 302: Calculate the frequency burstiness of the word _wi (1≤i≤r), assuming that the time point of the current emergency event detection is k, select the historical data of the previous p moments as a reference, and the word _wi is at time k. The frequency burstiness of a point is defined as:

其中，分子

为词w_i在k时间点出现的频率，分母中的Among them, the molecule

is the frequency of word _wi at time k, where in the denominator

F(w_i)越大，说明在当前k时间点，词w_i出现的频率增势越大，越有可能是突发词；The larger F( _wi ) is, it means that at the current k time point, the frequency of word _wi appears more frequently, and it is more likely to be a burst word;

步骤303、计算词w_i(1≤i≤r)的关联用户突发性，假设当前突发事件检测的时间点为k，选取之前的p个时刻的历史数据为参考，词w_i在k时间点的关联用户突发性定义为：

其中，分子

为k时间点，提及到词w_i的不同用户数量，分母中的

越大，说明k时间点，提及到词w_i的用户数量增势越大，词w_i越有可能是突发词；Step 303: Calculate the associated user emergencies of word _wi (1≤i≤r), assuming that the current emergency detection time point is k, select the historical data of the previous p moments as a reference, and the word _wi is at k. The associated user burstiness at a point in time is defined as:

Among them, the molecule

The larger the value, the more likely it is that the word _wi is a sudden word at the _k time point.

步骤304、计算词w_i(1≤i≤r)的地域突发性，词w_i在k时间点的分布地域突发性定义为：Step 304: Calculate the regional burstiness of word _wi (1≤i≤r), and the regional burstiness of word _wi at time point k is defined as:

其中，分子

为k时间点，提及到词w_i的不同地理标签的数量，分母中的

GT(w_i)越大，说明k时间点，提及到词w_i的地理标签数量增势越大，词w_i越有可能是突发词；Among them, the molecule

The larger the GT( _wi ), the greater the increase in the number of geotags referring to the word _wi at time k, and the more likely the word _wi is a sudden word;

步骤305、计算词w_i(1≤i≤r)的社交行为突发性，词w_i在k时间点的社交行为突发性定义为：

其中，分子

SB(w_i)越大，说明k时间点，提及到词w_i的社交行为数量增势越大，词w_i越有可能是突发词；Step 305: Calculate the suddenness of social behavior of word _wi (1≤i≤r). The suddenness of social behavior of word _wi at time k is defined as:

Among them, the molecule

The larger the SB( _wi ), the greater the increase in the number of social behaviors referring to the word _wi at time k, and the more likely the word _wi is a sudden word;

步骤306、综合上述词的四个突发性，最终得到一个词w_i在k时间点的突发值为：BurstyScore(w_i)＝α*F(w_i)+β*U(u|w_i)+χ*GT(gt|w_i)+δ*SB(sb|w_i)，其中，α、β、χ、δ为调节系数，用于调节四类指标的权重，α+β+χ+δ＝1，α≥0，β≥0，χ≥0，δ≥0。BurstyScore(w_i)越大，说明词w_i在k时间点的突发性越大，词w_i越有可能是突发词；Step 306 , synthesizing the four bursts of the above words, finally obtain the burst value of a word _wi at time k: BurstyScore( _wi )=α*F( _wi )+β*U(u|w _i )+χ*GT(gt|w _i )+δ*SB(sb|w _i ), where α, β, χ, and δ are adjustment coefficients, which are used to adjust the weights of the four types of indicators, α+β+χ +δ=1, α≥0, β≥0, χ≥0, δ≥0. The larger the BurstyScore( _wi ), the greater the burstiness of the word _wi at time k, and the more likely the word _wi is a burst word;

步骤307、在计算出每个词的突发值后，使用四分差选出n个突发词，构成突发词集合EW。四分差的距离计算方法为：IQS(EW)＝Q₃(EW)-Q₁(EW)。当一个词的突发值大于一定的阈值，则作为突发词，阈值的计算方法为：maxima(EW)＝Q₃(EW)+1.5×IQS(EW)。Step 307: After calculating the burst value of each word, select n burst words by using the quartile difference to form a burst word set EW. The distance calculation method of the quartile difference is: IQS(EW)=Q ₃ (EW)-Q ₁ (EW). When the burst value of a word is greater than a certain threshold, it is regarded as a burst word, and the calculation method of the threshold is: maxima(EW)=Q ₃ (EW)+1.5×IQS(EW).

步骤103、对EW中的突发词进行聚类，得到突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}，参照图4，其具体步骤如下：Step 103: Cluster the emergent words in the EW to obtain the emergent event word cluster EWC={ewc ₁ , ewc ₂ ,..., ewc _q }, referring to FIG. 4 , the specific steps are as follows:

步骤401、基于突发特征集EW，构建突发词关联网络EWN＝(V,E)，V是突发词集合EW，E表示突发词之间的关联强度。突发词ew_i、ew_j关联强度是统计两个词在同一篇微博博文中共现的次数；Step 401 , based on the burst feature set EW, construct a burst word association network EWN=(V, E), where V is the burst word set EW, and E represents the strength of association between burst words. The association strength of sudden words ew _i and ew _j is to count the number of co-occurrences of the two words in the same Weibo post;

步骤402、突发词关联网络EWN构建完成后，使用开源的CLUTO工具包对EWN进行聚类，获取突发事件词簇EWC＝{ewc₁,ewc₂,…,ewc_q}，假设有q个词簇。CLUTO提供三种聚类算法，既可以直接在聚类对象的特征空间上直接聚类，也可以按照对象的相似空间来聚类。这些算法为基于切分的、基于凝聚的和基于图形切分的。实际应用中，基于凝聚的层次聚类方法用的较多，因此本发明选用了凝聚层次聚类方法。Step 402: After the construction of the emergency word association network EWN is completed, use the open source CLUTO toolkit to cluster the EWN to obtain the emergency word cluster EWC={ewc ₁ ,ewc ₂ ,...,ewc _q }, assuming that there are q word clusters. CLUTO provides three clustering algorithms, which can either directly cluster on the feature space of the clustered objects, or cluster according to the similar space of the objects. These algorithms are slice-based, agglomerative-based, and graph-slice-based. In practical applications, agglomeration-based hierarchical clustering methods are often used, so the present invention selects agglomerative hierarchical clustering methods.

对比例：使用三种不同的微博网络地域突发事件检测方法，比较地域突发事件检测的有效性。三种方法如下：Comparative Example: Using three different Weibo network regional emergency detection methods to compare the effectiveness of regional emergency detection. The three methods are as follows:

(1)方法1-HBED，选取微博中包含的Hashtag，将Hashtag表示为向量模式，词的权重采用TF-IDF的方式计算，计算聚簇的热度时考虑了一个簇包含微博的数量变化。(1) Method 1-HBED, select the hashtag contained in the microblog, and represent the hashtag as a vector pattern, the weight of the word is calculated by TF-IDF, and the number of microblogs contained in a cluster is considered when calculating the heat of the cluster. .

(2)方法2-GeoBurst，首先在查询窗口内识别一些重要微博作为中心轴点，进一步的通过与历史数据在时空方面的比较得到突发事件。突发事件的排序根据词簇中词的时间和空间突发性。四个主要的参数设置：核函数宽度h＝0.01，重新开始概率α＝0.2，随机游走相似度阈值δ＝0.02，平衡时空突发性的参数η＝0.5。(2) Method 2-GeoBurst, firstly identify some important microblogs as central axis points in the query window, and further obtain emergencies by comparing with historical data in terms of time and space. The ordering of emergent events is based on the temporal and spatial emergencies of the words in the word cluster. Four main parameters are set: kernel function width h = 0.01, restart probability α = 0.2, random walk similarity threshold δ = 0.02, and parameter η = 0.5 for balancing spatiotemporal abruptness.

(3)方法3-LocTBED，本发明提出的方法，主要是提出的词的突发性计算，使用CLUTO提供的凝聚聚类方法bagglo进行聚类，簇的个数指定为10，聚类的相似度函数指定为余弦函数Cos。词的突发值计算时，词的历史考察时间设置为一周(7天)，四类指标累加时的调节参数α＝β＝χ＝δ＝0.25。(3) Method 3-LocTBED, the method proposed by the present invention is mainly the sudden calculation of the proposed word, using the agglomerative clustering method bagglo provided by CLUTO for clustering, the number of clusters is designated as 10, and the clusters are similar The degree function is specified as the cosine function Cos. When calculating the burst value of a word, the historical investigation time of the word is set to one week (7 days), and the adjustment parameter α=β=χ=δ=0.25 when the four types of indicators are accumulated.

本发明以真实的社交媒体-新浪微博为例，采集了北京、江苏省连云港市两个城市带有地理标签的微博，北京地区信息采集的时间是2016年12月1日-12月30日(一个月的数据)，共采集到346863条带地理标签的微博，连云港市信息采集的时间是2016年5月1日-10月31日(半年的数据)，共采集到63744条带地理标签的微博。以天为单位验证各种事件检测方法的有效性，即检测指定的某天的地域突发事件。The present invention takes the real social media-Sina Weibo as an example, and collects Weibo with geographic tags in two cities of Beijing and Lianyungang City, Jiangsu Province. The information collection time in Beijing is from December 1st to December 30th, 2016. A total of 346,863 microblogs with geotags were collected from 2016 (1 month data), and the information collection time of Lianyungang City was from May 1, 2016 to October 31, 2016 (half-year data), and a total of 63,744 microblogs were collected. Geo-tagged tweets. Verifies the effectiveness of various event detection methods on a daily basis, that is, detects regional emergencies on a specified day.

由于每个城市每天的地域突发事件是未知的，所以参考目前已有的主流研究方法，采用精准率P@n作为评价指标。对于每天检测到的Top-k突发事件，人工判断检测到的是否是地域突发事件，由于Top-k检测的事件数量较少，所以人工评测的工作量并不复杂。Since the daily regional emergencies in each city are unknown, the accuracy rate P@n is used as the evaluation index with reference to the existing mainstream research methods. For Top-k emergencies detected every day, it is manually determined whether the detected events are regional emergencies. Since the number of events detected by Top-k is small, the workload of manual evaluation is not complicated.

3种方法在5个评测指标上获取的结果如表1所示。The results obtained by the three methods on the five evaluation indicators are shown in Table 1.

表1.5种方法在5个评测指标上的检测结果Table 1. The detection results of 5 methods on 5 evaluation indicators

MethodsMethods P@1P@1 P@2P@2 P@3P@3 P@4P@4 P@5P@5 AverageAverage HBEDHBED 0.200.20 0.300.30 0.200.20 0.300.30 0.240.24 0.240.24 GeoBurstGeoBurst 0.800.80 0.700.70 0.800.80 0.750.75 0.720.72 0.720.72 LocTBEDLocTBED 0.800.80 0.800.80 0.870.87 0.800.80 0.760.76 0.760.76

对比3种方法，本文提出的方法LocTBED获取的效果最为理想，在5个评测指标上得到的平均值为0.76。其次是GeoBurst，在5个评测指标上得到的平均值为0.72。虽然这两种方法得到的值比较接近，但两者得到检测结果中的突发事件的排序有较大的区别。方法LocTBED在计算突发事件类簇的热度时，考虑了类簇包含的地域词的个数，对检测地域性突发事件有重要的帮助。Comparing the three methods, the method proposed in this paper, LocTBED, has the most ideal effect, and the average value obtained on the five evaluation indicators is 0.76. Next is GeoBurst, with an average of 0.72 on the five evaluation metrics. Although the values obtained by these two methods are relatively close, there is a big difference in the ordering of the emergent events in the detection results obtained by the two methods. The method LocTBED considers the number of regional words contained in the cluster when calculating the popularity of emergency event clusters, which is of great help in detecting regional emergencies.

方法HBED的效果偏差，主要原因是，获取的地理标签微博中，带有Hashtag的微博数量偏少，且多是广域性的事件，对地域性事件的检测不适用。The main reason for the deviation of the effect of the method HBED is that in the obtained geo-tagged micro-blogs, the number of micro-blogs with hashtags is relatively small, and most of them are wide-area events, which are not applicable to the detection of regional events.

本发明所述的方法并不限于具体实施方式中所述的实施例，本领域技术人员根据本发明的技术方案得出的其它的实施方式，同样属于本发明的技术创新范围。The method described in the present invention is not limited to the examples described in the specific implementation manner, and other embodiments obtained by those skilled in the art according to the technical solutions of the present invention also belong to the technical innovation scope of the present invention.

Claims

1. a detection method for microblog network regional emergencies, is characterized in that, its concrete steps are as follows:

A. Collect regional microblogs from the microblog network, obtain the microblog set PLMB, and obtain the microblog set LMB after preprocessing the microblog;

B. Extract the burst words from the microblog set LMB to obtain the burst word set EW;

C. Cluster the emergent words in EW, assuming that there are q word clusters, and obtain the emergent event word cluster EWC={ewc ₁ ,ewc ₂ ,...,ewc _q };

The specific steps of the step B are as follows:

B1. Perform word segmentation on each microblog lmb _i (1≤i≤n) in LMB, n represents the number of microblogs, remove stop words, keep nouns, verbs, place names, person names, proper nouns, and get the final The word set of is LMBW={w ₁ , w ₂ , ..., w _r }, assuming there are r words;

Among them, the molecule

is the frequency of word _wi at time k, where in the denominator

Among them, the molecule

Among them, the molecule

Among them, the molecule

B6. Synthesize the four bursts of steps B2, B3, B4, and B5, and finally obtain the burst value of a word _wi at time k: BurstyScore( _wi )=α*F( _wi )+β* U(u|w _i )+χ*GT(gt|w _i )+δ*SB(sb|w _i ), where α, β, χ, and δ are adjustment coefficients, which are used to adjust the weights of the four types of indicators, α+β+χ+δ=1, α≥0, β≥0, χ≥0, δ≥0;

B7. After calculating the burst value of each word, use the quartile difference to select n burst words to form the burst word set EW; the distance calculation method of the quartile difference is: IQS(EW)=Q ₃ ( EW)-Q ₁ (EW); when the burst value of a word is greater than a certain threshold, it is regarded as a burst word, and the calculation method of the threshold is: maxima(EW)=Q ₃ (EW)+1.5×IQS(EW) .

2. the detection method of a kind of microblog network regional emergency according to claim 1, is characterized in that: the concrete steps of above-mentioned step A are as follows:

A1. Use the collection tool to obtain the localized microblog information set PLMB={plmb ₁ , plmb ₂ , ..., plmb _m } where plmb _i (1≤i≤m) is each regional microblog; m represents the number of regional microblogs number of bars;

A2. Preprocess the microblog set PLMB, remove the link URL and emoticon information in the microblog, remove the microblogs whose length is less than 5 words, and obtain the preprocessed microblog set LMB, LMB={lmb ₁ , lmb ₂ , ..., lmb _n } where lmb _i (1≤i≤n) is each regional microblog.

3. the detection method of a kind of micro-blog network regional emergencies according to claim 1, is characterized in that, the concrete steps of described step C are as follows:

C1. Based on the burst feature set EW obtained in step B, construct a burst word association network EWN=(V, E), where V is the burst word set EW, and E represents the association strength between burst words; burst word ew The association strength of _i and ew _j is to count the number of co-occurrences of two words in the same Weibo post;

C2. After the construction of the EWN of the burst word association network is completed, use the open source CLUTO toolkit to cluster the EWN to obtain the burst word cluster EWC={ewc ₁ ,ewc ₂ ,...,ewc _q }, assuming that there are q words cluster.