CN112597309A

CN112597309A - Detection system for identifying microblog data stream of sudden event in real time

Info

Publication number: CN112597309A
Application number: CN202011566168.0A
Authority: CN
Inventors: 庄旭; 尹可鑫; 甘翼; 袁鑫; 丛迅超; 李贵
Original assignee: Southwest Electronic Technology Institute No 10 Institute of Cetc
Current assignee: CETC 10 Research Institute; Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-02

Abstract

The detection system for identifying the microblog data stream of the emergency in real time disclosed by the invention can quickly and accurately detect and identify the emergency without any prior knowledge about the event. The invention is realized by the following technical scheme: crawling the text data in real time by using a crawler tool; the entity extraction module extracts various types of named entities, and the trend identification module is adopted to obtain hot word lists related to different regions; the entity filtering module filters entities without heat; the similarity calculation module establishes a symbiotic matrix in the window, calculates entity similarity and constructs an entity relationship graph; the similarity filtering module filters edges with smaller values in the entity relationship graph; the entity clustering module obtains a corresponding cluster set by using a community discovery algorithm on the entity relation graph; the clustering link module continuously tracks the events in the event window; the clustering grading module grades the clustering results after clustering linkage according to the number of the hot words contained in the clustering results, and the data storage module stores the information of clustering grading.

Description

Detection system for identifying microblog data stream of sudden event in real time

Technical Field

The invention belongs to the technical field of emergency detection and identification, and particularly relates to a detection system for identifying an emergency microblog data stream in real time.

Background

With the rapid development of internet technology and the rapid development of social network services, news, forums, microblogs and social platforms with smart phone applications as carriers, some emerging internet services have become important platforms for people to spread and acquire information. Especially in recent years, the development of microblogs is just the world war and is favored by users due to real-time performance and convenience. People are able to distribute and obtain relevant information about a certain emergency in the "real world" at a first time. For example, the official certification account of the Xinlang microblog health care Commission in China has become a leading way for many Chinese people to know the real-time epidemic situation of the new coronary pneumonia.

In recent years, public data provided by internet technologies such as Twitter, Facebook, and twill microblog have received continuous attention from the industry and academia to detect and identify real-world emergencies. Through the social platforms with high interactivity, people can make real-time response to emergencies of the real world and can be used as an effective indicator of social hot events. Understanding what processes of occurrence and development are present on social media when an event occurs helps local governments and related organizations aid in decision-making and rapid action.

The data obtained from social platforms is streaming data that is characterized by being fast, massive, out-of-order, and requires fast response. And the information resources have the characteristics of heterogeneous, dispersive and serious repeated phenomena, are lack of uniform formal expression, form various information islands, and are difficult to integrate and utilize. How to satisfy the processing requirement of streaming data is also a hot topic of current research. Events are often extracted from a streaming data processing system, and then prediction analysis processing and expression of the events and topics are performed on the events occurring on the streaming data in the future, so that problems to be known can be conveniently and effectively obtained, and related application requirements are met.

The method has high reference value for public safety organizations, health and epidemic prevention organizations and the like which need to make corresponding response by using the streaming data to automatically detect and classify the events. The detection and identification of events based on social platform data streams is faced with many challenges, still in the exploration phase. First, social platforms typically place restrictions on the length of posts sent online, meaning that only a small amount of text is available for analysis. Second, informal, irregular, and abbreviated words are also often used in social platform data streams. Finally, social platforms also often have malicious content such as advertisements, pornography, viruses, and phishing.

The invention mainly tries to detect and identify the emergency based on the microblog data stream in real time. The event detection and identification specifically includes event evolution and the like. The evolution and evolution of the event are explored by continuously tracking the event through historical event information. At present, although there are many research results and some effective solutions in the aspects of real-time event detection and identification, most of these emergency identification methods only achieve detection and identification of global events or regional events (such as countries) (for example, large-scale natural disasters, armed conflicts, etc.), and do not detect and identify events in a small range (such as local epidemic, forest fire, etc.). In addition, some methods mostly need to manually set information such as the number of events, the types of events and the like, which often needs prior knowledge of massive materials and manual marking data. The method for realizing the method can generate the word cloud description about the emergency without any prior knowledge or manual marking.

Disclosure of Invention

In order to solve the problems, the invention provides a detection system which can quickly and accurately detect and identify the microblog data stream of the emergency without any prior knowledge about the event aiming at the defects of the existing large-scale microblog message stream research and the complexity of the structure and content form of the microblog event stream data.

In order to achieve the purpose, the invention adopts the technical scheme that: a detection system for identifying a microblog data stream of a sudden event in real time comprises the following steps: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster grading module and data storage module, construct a whole flow system from former microblog data stream to event detection, discernment and storage, its characterized in that:

the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities;

and crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data. Inputting the cleaned data into an entity extraction module, and extracting named entities contained in the data by using the entity extraction module;

the trend identification module takes the microblog as a data source of the network public sentiment of the emergency, extracts named entities and geographical regions in microblog data, stores the named entities and the geographical regions in a mode of entity, region and count, and calculates by using a region-entity binary group to obtain a hot word list related to different regions;

the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat by using the regional hot word list;

the similarity calculation module calculates the word Frequency (Frequency) of the remaining entities after the entities are filtered, simultaneously establishes a co-occurrence matrix (co-occurrences) in a determined window, calculates the similarity between the entities by means of the word Frequency counting and the co-occurrence matrix representing the mutual connection between the entities, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as an edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; '

The entity clustering module calculates the modularity of the region in the graph by using Louvain algorithm, and adjusts the fine granularity of Communities (Community) in the graph by using resolution R (resolution) to obtain a corresponding cluster set C_T；

The clustering link module collects clusters of the last time window C_T-1Cluster set C with current time window_T-Clustering between adjacent clusters is regarded as the problem of Bipartite Matching, clustering basic elements (Cluster) and events in each event window are continuously tracked, and clustering links are found out;

the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.

Compared with the prior art, the invention has the following beneficial effects:

the invention adopts an entity extraction module, an entity filtering module, a similarity calculation module, a similarity filtering module, a cluster linking module, a cluster grading module and a data storage module which are sequentially connected in series, and a full-flow system from the original microbump data flow to event detection, identification and storage is constructed by adopting the [ l2] entity extraction module, the entity filtering module, the cluster linking module, the cluster grading module and the data storage module, so that the word cloud description about the emergency can be generated without any prior knowledge about the event, any prior knowledge or manual marking, and the emergency can be quickly and accurately detected and identified.

According to the method, text data are crawled from official authentication microblogs and various large V account numbers in real time by using a crawler technology, the crawled data are subjected to data cleaning, Chinese stop words are removed, the geographic position indicated by the text data is stored, and the sudden topic [ l3] can be detected earlier.

The entity extraction module adopted by the invention is based on a RoBERTA-wwm-large-ext model and trains on NER data sets issued by CLUE organizations. Finally, the recognition effect obviously superior to BERT and Bi-Lstm + CRF is achieved on the recognition task of the Chinese fine-grained named entity.

The trend identification module adopted by the invention extracts and counts the region-entity binary group, extracts useful entities and geographic areas for a given microblog data, stores the useful entities and the geographic areas in the mode of entity, region and count, and calculates by using the region-entity binary group to obtain the hot word list related to different regions.

The entity filtering module adopted by the invention continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat by using the regional hot word list;

the similarity calculation module adopted by the invention calculates the word Frequency (Frequency) of the residual entities after the entity filtering, simultaneously establishes a symbiotic matrix (co-occurrents) in a determined window, calculates the similarity by means of the word Frequency counting and the symbiotic matrix representing the mutual connection between the words, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as the edge. (ii) a

The similarity filtering module is adopted to filter out the entity relationship graph with the similarity between the entity nodes smaller than the threshold S, and the entity relationship graph processed by the similarity filtering module can be clustered by using a community discovery method in graph theory.

The entity clustering module [ l4] adopted by the invention utilizes Louvain algorithm to calculate the region modularity in the graph and utilizes the resolution ratio R to adjust the fine granularity of the community in the graph. A larger R value setting may identify a graph as yielding smaller communities and a smaller R value may identify yielding larger communities. Different from the common algorithm based on modularity and modularity gain, Louvain has high speed and has a particularly obvious clustering effect on some graphs with few points and multiple edges. By adopting the Louvain algorithm, the application scene with the data volume of up to millions per minute can be dealt with.

The clustering link module is adopted, the clustering result in two continuous time windows is regarded as the maximum matching problem of the bipartite graph, and the KM algorithm capable of processing the maximum matching of the weighted bipartite graph is adopted, so that the continuous tracking of the event is realized.

The method effectively converts the event detection and identification problems into the community discovery and binary matching problems in the graph theory, simplifies and optimizes a plurality of complicated calculation processes, maintains higher accuracy and reliability on the premise of ensuring real-time performance, and is more suitable for application scenes needing to detect emergency events from mass data in real time.

Drawings

FIG. 1 is a schematic diagram of an organization architecture of a detection system for identifying a microblog data stream of a sudden event in real time according to the present invention;

fig. 2 is a data processing flow diagram of fig. 1.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.

Detailed Description

Refer to fig. 1 and 2. In a preferred embodiment described below, a detection system for identifying a microblog data stream of a sudden event in real time comprises: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster grading module and data storage module, construct a whole flow system from former microblog data stream to event detection, discernment and storage, its characterized in that: the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities; and crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data. And inputting the cleaned data into an entity extraction module, and extracting the named entities contained in the data in real time by using the entity extraction module. The trend identification module takes the microblog as a data source of the network public sentiment of the emergency, and extracts named entities and geographical regions in the microblog data so as to<Entity, regionCounting of>Storing the mode, and calculating by using a region-entity binary group to obtain a hot word list related to different regions; the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat degree by using the regional hot word list; the similarity calculation module calculates the word Frequency (Frequency) of the remaining entities after the entities are filtered, simultaneously establishes a co-occurrence matrix (co-occurrences) in a determined window, calculates the similarity by means of the word Frequency count and the co-occurrence matrix representing the mutual connection between the words, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as the edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; the entity clustering module calculates modularity of regions in the entity relationship diagram by using a Louvain algorithm, and adjusts fine granularity of Communities (Community) in the diagram by using resolution R (resolution) to obtain a corresponding cluster set C_T(ii) a The clustering link module collects clusters of the last time window C_T-1Cluster set C with current time window_T-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, and clustering basic elements (clusters) and events in each event window are continuously tracked to find out clustering links; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.

The trend identification module extracts named entities according to the entity extraction model, and simultaneously establishes a region hot word list by combining with the geographical position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring the region hot words, wherein the modularity and closeness evaluation model actually represents the occurrence frequency of a certain entity E in a certain region d, and E (d, E) actually represents the expected value of the occurrence frequency of the certain entity in the next time window as shown in formula (1):

storing each entity or hotword with a previous expected score in a memory for subsequent use, wherein Ns represents one entity or hotword with a previous expected scoreCount within a shorter time window, N_lRepresenting a count for a longer time window, d representing a region, and e representing a named entity.

the similarity calculation module judges the similarity between different named entities, constructs an entity relationship Graph (Graph) by taking the numerical value of the entity similarity as an edge, and calculates the cosine similarity of the entities X and Y by adopting a similarity calculation formula (2) as shown in the specification:

and the similarity filtering module filters the edge with a smaller value in the entity relationship diagram, and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, the edge connecting the two entities in the entity relationship diagram is deleted.

The entity clustering module calculates the modularity and closeness of the community in the graph based on a community discovery algorithm Louvain, adjusts the fine granularity of the community in the graph by using a resolution ratio R, and the modularity is represented by a weight A of the connection between the nodes i and j_ijAnd the sum of the weights k of all edges connected to node i_i＝∑_jA_ijAnd representing the sum of the weights of the entire network connection

Calculating to obtain the modularity, and calculating the modularity by using a modularity calculation formula shown in formula (3):

where m represents the sum of the weights of the network connections, k_iSum of weights, k, of all edges connected to node i_jRepresents the sum of the weights, δ (c), of all edges connected to node j_i,c_j) Indicating whether the nodes i, j are in the same community (equal fetch)1, taken at different times 0), c_i，c_jIndicating the community number of the node i, j.

The entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain calculation diagram:

the first stage is as follows: according to an evaluation model of modularity and closeness, distributing a community number for each node, classifying similar points into one class, making an identifier for each node, and calculating the community modularity variation delta Q according to a calculation formula shown in (4):

therein, sigma_inIs the sum of the weights, Σ, of the connections within the community_totIs the sum of the weights of all edges connected to the community.

And a second stage: and the entity clustering module re-initializes the new graph, namely the same class is the same node, repeats the evaluation process of the first stage on the modularity and the closeness, completes one iteration and terminates the condition reference evaluation model.

The clustering link module collects clusters of the last time window C_T-1Cluster set C with current time window_T-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, clustering basic elements (clusters) and events in each event window are continuously tracked, and clustering links are found out; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module stores the information such as cluster link, cluster grading and the like correspondingly.

The data crawled using crawler technology contains the publishing area information or at least can infer the geographic area covered by its microblog content from the geographic locality of the media. And (3) performing data cleaning on the crawled data, wherein the data cleaning comprises filtering and deleting again, low-quality or sensitive contents in the text are deleted in the filtering process, and similar or repeated microblog information is deleted in the deleting again process, so that the contribution of a single user to the trend is limited.

And the entity extraction module is used for training on NER data sets issued by CLUE organizations based on a RoBERTA-wwm-large-ext model. Finally, the recognition effect obviously superior to BERT and Bi-Lstm + CRF is achieved on the recognition task of the Chinese fine-grained named entity.

The trend identification module extracts named entities according to the entity extraction model, simultaneously establishes a region hot word list by combining geographic position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring region hot words, such as a formula (1) N_lThe actual representation shown represents the number of occurrences of an entity E in a region d, and E (d, E) actually represents the expected value of the number of occurrences of an entity in the next time window:

and storing each entity or hotword with the highest expected score in a memory for later use, wherein Ns represents the count in a shorter time window, represents the count in a longer time window, d represents the region, and e represents a named entity.

The entity filtering module filters the entities with low heat degree by using the regional hot word list of each region. And then, establishing an entity co-occurrence matrix according to the word frequency in a determined time window by using a similarity calculation module. Then, the similarity between different named entities is judged, the calculation mode of the similarity is as follows, and three pieces of microblog data are given: "iphone publishes # apple pub during apple pub", "the cook pushes a new iphone # apple pub", "the cook shows a new iphone", for the entity iphone and # apple pub, their entity vector distributions are: iphone ═ 1,1,1, and # apple release ═ 1,1,0, so the cosine similarity of iphone and # apple release is: cos (iphone, # apple association) ═ 0.81649. And if the similarity between the two named entities is smaller than the threshold S, deleting the item where the entity co-occurrence matrix is positioned, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.

Similarity ofThe similarity calculation module judges the similarity between different named entities and adopts a similarity calculation formula (2) shown as follows to calculate: :

and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, deleting the item where the entity co-occurrence matrix is positioned, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.

So the cosine similarity of iphone and # apple release party is: cos (iphone, # apple association) ═ 0.81649. If the similarity between the two named entities is smaller than the threshold S, deleting the item where the entity co-occurrence matrix is located, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.

The Louvain algorithm is a community discovery algorithm in graph theory, and the core algorithm principle is to calculate the modularity and the closeness of communities in a graph. On the one hand, the graph needs to be split into different parts, and on the other hand, the modularity (i.e. how good the quality) of the part needs to be measured. The difference between the Louvain algorithm and the common algorithm based on modularity and modularity gain is that the algorithm is fast, and the clustering effect on some graphs with few points and multiple edges is particularly obvious. By adopting the Louvain algorithm, the application scene with the data volume of up to millions per minute can be dealt with.

And the entity clustering module adjusts the fine granularity of the community in the graph by utilizing the resolution ratio R based on the region modularity and the compactness in the community discovery algorithm Louvain calculation graph. A larger R value setting may identify a graph as yielding smaller communities and a smaller R value may identify yielding larger communities. The modularity is represented by the weight A representing the connection between nodes I, j_ijSum of weights k of all edges connected to node i_i＝∑_jA_ijAnd representing the sum of the weights of the entire network connection

Calculating to obtain the modularity, wherein the modularity is calculated by using a modularity calculation formula shown in formula (3):

where m represents the sum of the weights of the network connections, k_iSum of weights, k, of all edges connected to node i_jRepresents the sum of the weights, δ (c), of all edges connected to node j_i,c_j) Indicates whether the nodes i, j are in the same community (1 is taken at the same time, 0 is taken at different time), c_i，c_jIndicating the community number of the node i, j. Note that the value of the modularity Q is in the range of [ -1,1 []In the above paragraph. When i, j have no edges connected, we can consider A_ij0, however other terms may be greater than 0; this setting means that adding a node in this way (but not connected to some point in the community) has a negative effect.

Specifically, the entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain computational graph:

what needs to be done in the first stage is classification, and first, each node is assigned a community number, that is, the network has N communities at this time, which is an initial state. Then, for each node i, let i and the community become j, considering its neighbor j, to see what the value of modularity is after this action is performed. If Δ Q is positive with this action, then the action is accepted, otherwise the original allocation pattern is maintained. Then, calculating the community modularity variation quantity delta Q according to a calculation formula shown in (4):

And the clustering link module regards the clustering results in two continuous time windows as the maximum matching problem of the bipartite graph, and adopts a KM algorithm capable of processing the maximum matching of the weighted bipartite graph to realize continuous tracking of the event.

Bipartite graphs, also known as bipartite graphs, and even graphs, refer to graphs in which vertices can be divided into two disjoint sets U and V, such that vertices in the same set are not adjacent (have no common edges). The weighted matching of the bipartite graph is to find a matching set, so that the sum of weights of edges in the set is maximum or minimum.

The KM algorithm is the algorithm used to find the best match for the weighted bipartite graph. The general description of the KM algorithm can be summarized in the following steps: and initializing the feasible benchmarks, searching for a complete match by using a Hungarian algorithm, modifying the feasible benchmarks if the complete match is not found, and repeating the step two and the step three until the complete match of the equal subgraphs is found. The feasibility marker post is used for solving a vertex value of a node for any node in the original graph L (node). The node vertex values in set X may be recorded by array Lx (X), and the node vertex values in set Y may be recorded by array Ly (Y). And any edge (x, y) in the original graph satisfies Lx (x) + Ly (y) ≧ weight (x, y).

The basic principle of the Hungarian algorithm is that an alternate path is taken from one unmatched point, the other unmatched point is taken as the tail, the head and the tail are all unmatched points, and the edges of the head and the tail are all non-matched edges. And are alternate paths, i.e., one more non-matching edge than matching edge. Then the matched edge and the non-matched edge in the augmented road can be called completely, so that one more matched edge is needed, and the augmented significance is realized. The core part of the hungarian algorithm is therefore: the extended paths are always found and continuously exchanged and matched.

The clustering link module searches for the best match of the weighted bipartite graph by adopting a KM algorithm, initializes a feasible benchmark, finds out clustering links, and modifies the feasible benchmark if complete match is not found; and continuously searching for complete matching by using a Hungary algorithm until the complete matching of equal subgraphs is found, giving a function L (node) by using a feasible standard bar of any node in the original image to find out a top standard value of the node, recording the top standard value of the node in the set X by using an array Lx (X) and an array Ly (Y) to record the top standard value of the node in the set Y, and starting from an unmatched point by using the Hungary algorithm, walking alternate paths and ending at another unmatched point by using a Hungary algorithm when any edge (X, Y) in the original image meets the requirement of Lx (X) + Ly (Y) not less than weight (X, Y). The clustering ranking module ranks the clustering results subjected to clustering linkage according to the number of hot words contained in the clustering results, wherein the higher the ranking is, the more urgent the event is; and the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A detection system for identifying a microblog data stream of a sudden event in real time comprises the following steps: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster module and data storage module, construct a whole flow system from original microblog data stream to event detection, discernment and storage, its characterized in that: the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities; crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data; inputting the cleaned data into an entity extraction module, and extracting named entities contained in the data in real time by using the entity extraction module; the trend identification module takes the microblog as a outburstSending a data source of the event network public opinion, extracting named entities and geographical regions in microblog data, storing the named entities and the geographical regions in an entity, area and counting mode, and calculating by using a region-entity binary group to obtain a hot word list related to different regions; the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat degree by using the regional hot word list; the similarity calculation module calculates the word Frequency (Frequency) of the residual entities after the entity filtering, simultaneously establishes an entity co-occurrence matrix (co-occurrences) in a determined window, calculates the entity similarity by means of the word Frequency counting and the co-occurrence matrix, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as an edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; the entity clustering module calculates the modularity of Communities in the entity relation graph by using a Louvain algorithm, and adjusts the fine granularity of the Communities (Community) in the graph by using a resolution ratio R (resolution) to obtain a corresponding cluster set C_T(ii) a The clustering link module collects clusters of the last time window C_T-1Cluster set C with current time window_T-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, and clustering basic elements (clusters) and events in each event window are continuously tracked to find out clustering links; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.

2. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the trend identification module extracts named entities according to the entity extraction model, and simultaneously establishes a region hot word list by combining with the geographical position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring the region hot words, wherein the modularity and closeness evaluation model actually represents the occurrence frequency of a certain entity E in a certain region d, and E (d, E) actually represents the expected value of the occurrence frequency of the certain entity in the next time window as shown in formula (1):

storing each entity or hotword with the highest expected score in the memory for subsequent use,

where Ns denotes the count within a short time window, N_lIndicates a count for a longer time window, d indicates a region, and e indicates a named entity.

3. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the similarity calculation module judges the similarity between different named entities and calculates the cosine similarity of the entities X and Y by adopting a similarity calculation formula (2) shown as follows:

4. the detection system for identifying the microblog data stream of the sudden event in real time according to claim 3, wherein: and the similarity filtering module filters the similarity between the entities, and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, the edge connecting the two entities in the entity relationship graph is deleted.

5. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the entity clustering module adjusts the fine granularity of the community in the graph by utilizing the resolution ratio R based on the region modularity and the compactness in the community discovery algorithm Louvain calculation graph, wherein the modularity is represented by the weight A of the connection between the nodes i and j_ijAnd the sum of the weights k of all edges connected to node i_i＝∑_jA_ijAnd representing the sum of the weights of the entire network connection

Calculating the modularity by using a modularity calculation formula shown in formula (3):

where m represents the sum of the weights of the network connections, k_iSum of weights, k, of all edges connected to node i_jRepresents the sum of the weights, δ (c), of all edges connected to node j_i,c_j) Indicates whether the nodes i and j are in the same community (1 is taken at the same time, 0 is not taken at the same time), c_i，c_jIndicating the community number of the node i, j.

6. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 4, wherein: the entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain calculation diagram:

7. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 5, wherein: and a second stage: and the entity clustering module re-initializes the new graph, namely the same class is the same node, repeats the evaluation process of the first stage on the modularity and the closeness, completes one iteration and terminates the condition reference evaluation model.

8. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the clustering link module searches for the best match of the weighted bipartite graph by adopting a KM algorithm, initializes a feasible benchmark, finds out clustering links, and modifies the feasible benchmark if complete match is not found; continuously searching for complete matching by using a Hungarian algorithm until complete matching of equal subgraphs is found, wherein a feasibility benchmarking means that a function Lnod is given to any node in an original graph to solve a top standard value of the node; the node vertex values in set X may be recorded by array Lx (X), and the node vertex values in set Y may be recorded by array Ly (Y).

9. The system of claim 8, wherein the system is configured to identify the stream of the microblog data of the sudden event in real time: the clustering link module utilizes a feasibility marker post of any node in the original image, a function Lnod is given to calculate a top mark value of the node, an array group Lx (X) is used for recording the top mark value of the node in the set X and an array Ly (Y) is used for recording the top mark value of the node in the set Y, and according to the condition that any edge (X, Y) in the original image meets Lx (X) + Ly (Y) and is not less than weight (X, Y), a Hungary algorithm is adopted to start from an unmatched point, alternate paths are taken, another unmatched point is taken as the end, an extended path is always found, and the exchange and matching are carried out continuously.

10. The system of claim 8, wherein the system is configured to identify the stream of the microblog data of the sudden event in real time: the Hungarian algorithm starts from one unmatched point, walks alternate paths, ends with the other unmatched point, has one more unmatched edge than the matched edge, always finds the matched edge and the unmatched edge in the augmented path, and continuously exchanges matching.