CN112597309A - Detection system for identifying microblog data stream of sudden event in real time - Google Patents

Detection system for identifying microblog data stream of sudden event in real time Download PDF

Info

Publication number
CN112597309A
CN112597309A CN202011566168.0A CN202011566168A CN112597309A CN 112597309 A CN112597309 A CN 112597309A CN 202011566168 A CN202011566168 A CN 202011566168A CN 112597309 A CN112597309 A CN 112597309A
Authority
CN
China
Prior art keywords
entity
module
clustering
similarity
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011566168.0A
Other languages
Chinese (zh)
Inventor
庄旭
尹可鑫
甘翼
袁鑫
丛迅超
李贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 10 Research Institute
Southwest Electronic Technology Institute No 10 Institute of Cetc
Original Assignee
Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Electronic Technology Institute No 10 Institute of Cetc filed Critical Southwest Electronic Technology Institute No 10 Institute of Cetc
Priority to CN202011566168.0A priority Critical patent/CN112597309A/en
Publication of CN112597309A publication Critical patent/CN112597309A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Computing Systems (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The detection system for identifying the microblog data stream of the emergency in real time disclosed by the invention can quickly and accurately detect and identify the emergency without any prior knowledge about the event. The invention is realized by the following technical scheme: crawling the text data in real time by using a crawler tool; the entity extraction module extracts various types of named entities, and the trend identification module is adopted to obtain hot word lists related to different regions; the entity filtering module filters entities without heat; the similarity calculation module establishes a symbiotic matrix in the window, calculates entity similarity and constructs an entity relationship graph; the similarity filtering module filters edges with smaller values in the entity relationship graph; the entity clustering module obtains a corresponding cluster set by using a community discovery algorithm on the entity relation graph; the clustering link module continuously tracks the events in the event window; the clustering grading module grades the clustering results after clustering linkage according to the number of the hot words contained in the clustering results, and the data storage module stores the information of clustering grading.

Description

Detection system for identifying microblog data stream of sudden event in real time
Technical Field
The invention belongs to the technical field of emergency detection and identification, and particularly relates to a detection system for identifying an emergency microblog data stream in real time.
Background
With the rapid development of internet technology and the rapid development of social network services, news, forums, microblogs and social platforms with smart phone applications as carriers, some emerging internet services have become important platforms for people to spread and acquire information. Especially in recent years, the development of microblogs is just the world war and is favored by users due to real-time performance and convenience. People are able to distribute and obtain relevant information about a certain emergency in the "real world" at a first time. For example, the official certification account of the Xinlang microblog health care Commission in China has become a leading way for many Chinese people to know the real-time epidemic situation of the new coronary pneumonia.
In recent years, public data provided by internet technologies such as Twitter, Facebook, and twill microblog have received continuous attention from the industry and academia to detect and identify real-world emergencies. Through the social platforms with high interactivity, people can make real-time response to emergencies of the real world and can be used as an effective indicator of social hot events. Understanding what processes of occurrence and development are present on social media when an event occurs helps local governments and related organizations aid in decision-making and rapid action.
The data obtained from social platforms is streaming data that is characterized by being fast, massive, out-of-order, and requires fast response. And the information resources have the characteristics of heterogeneous, dispersive and serious repeated phenomena, are lack of uniform formal expression, form various information islands, and are difficult to integrate and utilize. How to satisfy the processing requirement of streaming data is also a hot topic of current research. Events are often extracted from a streaming data processing system, and then prediction analysis processing and expression of the events and topics are performed on the events occurring on the streaming data in the future, so that problems to be known can be conveniently and effectively obtained, and related application requirements are met.
The method has high reference value for public safety organizations, health and epidemic prevention organizations and the like which need to make corresponding response by using the streaming data to automatically detect and classify the events. The detection and identification of events based on social platform data streams is faced with many challenges, still in the exploration phase. First, social platforms typically place restrictions on the length of posts sent online, meaning that only a small amount of text is available for analysis. Second, informal, irregular, and abbreviated words are also often used in social platform data streams. Finally, social platforms also often have malicious content such as advertisements, pornography, viruses, and phishing.
The invention mainly tries to detect and identify the emergency based on the microblog data stream in real time. The event detection and identification specifically includes event evolution and the like. The evolution and evolution of the event are explored by continuously tracking the event through historical event information. At present, although there are many research results and some effective solutions in the aspects of real-time event detection and identification, most of these emergency identification methods only achieve detection and identification of global events or regional events (such as countries) (for example, large-scale natural disasters, armed conflicts, etc.), and do not detect and identify events in a small range (such as local epidemic, forest fire, etc.). In addition, some methods mostly need to manually set information such as the number of events, the types of events and the like, which often needs prior knowledge of massive materials and manual marking data. The method for realizing the method can generate the word cloud description about the emergency without any prior knowledge or manual marking.
Disclosure of Invention
In order to solve the problems, the invention provides a detection system which can quickly and accurately detect and identify the microblog data stream of the emergency without any prior knowledge about the event aiming at the defects of the existing large-scale microblog message stream research and the complexity of the structure and content form of the microblog event stream data.
In order to achieve the purpose, the invention adopts the technical scheme that: a detection system for identifying a microblog data stream of a sudden event in real time comprises the following steps: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster grading module and data storage module, construct a whole flow system from former microblog data stream to event detection, discernment and storage, its characterized in that:
the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities;
and crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data. Inputting the cleaned data into an entity extraction module, and extracting named entities contained in the data by using the entity extraction module;
the trend identification module takes the microblog as a data source of the network public sentiment of the emergency, extracts named entities and geographical regions in microblog data, stores the named entities and the geographical regions in a mode of entity, region and count, and calculates by using a region-entity binary group to obtain a hot word list related to different regions;
the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat by using the regional hot word list;
the similarity calculation module calculates the word Frequency (Frequency) of the remaining entities after the entities are filtered, simultaneously establishes a co-occurrence matrix (co-occurrences) in a determined window, calculates the similarity between the entities by means of the word Frequency counting and the co-occurrence matrix representing the mutual connection between the entities, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as an edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; '
The entity clustering module calculates the modularity of the region in the graph by using Louvain algorithm, and adjusts the fine granularity of Communities (Community) in the graph by using resolution R (resolution) to obtain a corresponding cluster set CT
The clustering link module collects clusters of the last time window CT-1Cluster set C with current time windowT-Clustering between adjacent clusters is regarded as the problem of Bipartite Matching, clustering basic elements (Cluster) and events in each event window are continuously tracked, and clustering links are found out;
the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.
Compared with the prior art, the invention has the following beneficial effects:
the invention adopts an entity extraction module, an entity filtering module, a similarity calculation module, a similarity filtering module, a cluster linking module, a cluster grading module and a data storage module which are sequentially connected in series, and a full-flow system from the original microbump data flow to event detection, identification and storage is constructed by adopting the [ l2] entity extraction module, the entity filtering module, the cluster linking module, the cluster grading module and the data storage module, so that the word cloud description about the emergency can be generated without any prior knowledge about the event, any prior knowledge or manual marking, and the emergency can be quickly and accurately detected and identified.
According to the method, text data are crawled from official authentication microblogs and various large V account numbers in real time by using a crawler technology, the crawled data are subjected to data cleaning, Chinese stop words are removed, the geographic position indicated by the text data is stored, and the sudden topic [ l3] can be detected earlier.
The entity extraction module adopted by the invention is based on a RoBERTA-wwm-large-ext model and trains on NER data sets issued by CLUE organizations. Finally, the recognition effect obviously superior to BERT and Bi-Lstm + CRF is achieved on the recognition task of the Chinese fine-grained named entity.
The trend identification module adopted by the invention extracts and counts the region-entity binary group, extracts useful entities and geographic areas for a given microblog data, stores the useful entities and the geographic areas in the mode of entity, region and count, and calculates by using the region-entity binary group to obtain the hot word list related to different regions.
The entity filtering module adopted by the invention continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat by using the regional hot word list;
the similarity calculation module adopted by the invention calculates the word Frequency (Frequency) of the residual entities after the entity filtering, simultaneously establishes a symbiotic matrix (co-occurrents) in a determined window, calculates the similarity by means of the word Frequency counting and the symbiotic matrix representing the mutual connection between the words, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as the edge. (ii) a
The similarity filtering module is adopted to filter out the entity relationship graph with the similarity between the entity nodes smaller than the threshold S, and the entity relationship graph processed by the similarity filtering module can be clustered by using a community discovery method in graph theory.
The entity clustering module [ l4] adopted by the invention utilizes Louvain algorithm to calculate the region modularity in the graph and utilizes the resolution ratio R to adjust the fine granularity of the community in the graph. A larger R value setting may identify a graph as yielding smaller communities and a smaller R value may identify yielding larger communities. Different from the common algorithm based on modularity and modularity gain, Louvain has high speed and has a particularly obvious clustering effect on some graphs with few points and multiple edges. By adopting the Louvain algorithm, the application scene with the data volume of up to millions per minute can be dealt with.
The clustering link module is adopted, the clustering result in two continuous time windows is regarded as the maximum matching problem of the bipartite graph, and the KM algorithm capable of processing the maximum matching of the weighted bipartite graph is adopted, so that the continuous tracking of the event is realized.
The method effectively converts the event detection and identification problems into the community discovery and binary matching problems in the graph theory, simplifies and optimizes a plurality of complicated calculation processes, maintains higher accuracy and reliability on the premise of ensuring real-time performance, and is more suitable for application scenes needing to detect emergency events from mass data in real time.
Drawings
FIG. 1 is a schematic diagram of an organization architecture of a detection system for identifying a microblog data stream of a sudden event in real time according to the present invention;
fig. 2 is a data processing flow diagram of fig. 1.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.
Detailed Description
Refer to fig. 1 and 2. In a preferred embodiment described below, a detection system for identifying a microblog data stream of a sudden event in real time comprises: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster grading module and data storage module, construct a whole flow system from former microblog data stream to event detection, discernment and storage, its characterized in that: the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities; and crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data. And inputting the cleaned data into an entity extraction module, and extracting the named entities contained in the data in real time by using the entity extraction module. The trend identification module takes the microblog as a data source of the network public sentiment of the emergency, and extracts named entities and geographical regions in the microblog data so as to<Entity, regionCounting of>Storing the mode, and calculating by using a region-entity binary group to obtain a hot word list related to different regions; the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat degree by using the regional hot word list; the similarity calculation module calculates the word Frequency (Frequency) of the remaining entities after the entities are filtered, simultaneously establishes a co-occurrence matrix (co-occurrences) in a determined window, calculates the similarity by means of the word Frequency count and the co-occurrence matrix representing the mutual connection between the words, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as the edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; the entity clustering module calculates modularity of regions in the entity relationship diagram by using a Louvain algorithm, and adjusts fine granularity of Communities (Community) in the diagram by using resolution R (resolution) to obtain a corresponding cluster set CT(ii) a The clustering link module collects clusters of the last time window CT-1Cluster set C with current time windowT-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, and clustering basic elements (clusters) and events in each event window are continuously tracked to find out clustering links; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.
The trend identification module extracts named entities according to the entity extraction model, and simultaneously establishes a region hot word list by combining with the geographical position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring the region hot words, wherein the modularity and closeness evaluation model actually represents the occurrence frequency of a certain entity E in a certain region d, and E (d, E) actually represents the expected value of the occurrence frequency of the certain entity in the next time window as shown in formula (1):
Figure RE-GDA0002943043720000051
storing each entity or hotword with a previous expected score in a memory for subsequent use, wherein Ns represents one entity or hotword with a previous expected scoreCount within a shorter time window, NlRepresenting a count for a longer time window, d representing a region, and e representing a named entity.
The entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat by using the regional hot word list;
the similarity calculation module judges the similarity between different named entities, constructs an entity relationship Graph (Graph) by taking the numerical value of the entity similarity as an edge, and calculates the cosine similarity of the entities X and Y by adopting a similarity calculation formula (2) as shown in the specification:
Figure BDA0002861780160000052
and the similarity filtering module filters the edge with a smaller value in the entity relationship diagram, and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, the edge connecting the two entities in the entity relationship diagram is deleted.
The entity clustering module calculates the modularity and closeness of the community in the graph based on a community discovery algorithm Louvain, adjusts the fine granularity of the community in the graph by using a resolution ratio R, and the modularity is represented by a weight A of the connection between the nodes i and jijAnd the sum of the weights k of all edges connected to node ii=∑jAijAnd representing the sum of the weights of the entire network connection
Figure BDA0002861780160000053
Calculating to obtain the modularity, and calculating the modularity by using a modularity calculation formula shown in formula (3):
Figure BDA0002861780160000054
where m represents the sum of the weights of the network connections, kiSum of weights, k, of all edges connected to node ijRepresents the sum of the weights, δ (c), of all edges connected to node ji,cj) Indicating whether the nodes i, j are in the same community (equal fetch)1, taken at different times 0), ci,cjIndicating the community number of the node i, j.
The entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain calculation diagram:
the first stage is as follows: according to an evaluation model of modularity and closeness, distributing a community number for each node, classifying similar points into one class, making an identifier for each node, and calculating the community modularity variation delta Q according to a calculation formula shown in (4):
Figure BDA0002861780160000061
therein, sigmainIs the sum of the weights, Σ, of the connections within the communitytotIs the sum of the weights of all edges connected to the community.
And a second stage: and the entity clustering module re-initializes the new graph, namely the same class is the same node, repeats the evaluation process of the first stage on the modularity and the closeness, completes one iteration and terminates the condition reference evaluation model.
The clustering link module collects clusters of the last time window CT-1Cluster set C with current time windowT-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, clustering basic elements (clusters) and events in each event window are continuously tracked, and clustering links are found out; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module stores the information such as cluster link, cluster grading and the like correspondingly.
The data crawled using crawler technology contains the publishing area information or at least can infer the geographic area covered by its microblog content from the geographic locality of the media. And (3) performing data cleaning on the crawled data, wherein the data cleaning comprises filtering and deleting again, low-quality or sensitive contents in the text are deleted in the filtering process, and similar or repeated microblog information is deleted in the deleting again process, so that the contribution of a single user to the trend is limited.
And the entity extraction module is used for training on NER data sets issued by CLUE organizations based on a RoBERTA-wwm-large-ext model. Finally, the recognition effect obviously superior to BERT and Bi-Lstm + CRF is achieved on the recognition task of the Chinese fine-grained named entity.
The trend identification module extracts named entities according to the entity extraction model, simultaneously establishes a region hot word list by combining geographic position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring region hot words, such as a formula (1) NlThe actual representation shown represents the number of occurrences of an entity E in a region d, and E (d, E) actually represents the expected value of the number of occurrences of an entity in the next time window:
Figure RE-GDA0002943043720000062
and storing each entity or hotword with the highest expected score in a memory for later use, wherein Ns represents the count in a shorter time window, represents the count in a longer time window, d represents the region, and e represents a named entity.
The entity filtering module filters the entities with low heat degree by using the regional hot word list of each region. And then, establishing an entity co-occurrence matrix according to the word frequency in a determined time window by using a similarity calculation module. Then, the similarity between different named entities is judged, the calculation mode of the similarity is as follows, and three pieces of microblog data are given: "iphone publishes # apple pub during apple pub", "the cook pushes a new iphone # apple pub", "the cook shows a new iphone", for the entity iphone and # apple pub, their entity vector distributions are: iphone ═ 1,1,1, and # apple release ═ 1,1,0, so the cosine similarity of iphone and # apple release is: cos (iphone, # apple association) ═ 0.81649. And if the similarity between the two named entities is smaller than the threshold S, deleting the item where the entity co-occurrence matrix is positioned, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.
Similarity ofThe similarity calculation module judges the similarity between different named entities and adopts a similarity calculation formula (2) shown as follows to calculate: :
Figure BDA0002861780160000071
and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, deleting the item where the entity co-occurrence matrix is positioned, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.
So the cosine similarity of iphone and # apple release party is: cos (iphone, # apple association) ═ 0.81649. If the similarity between the two named entities is smaller than the threshold S, deleting the item where the entity co-occurrence matrix is located, and finally obtaining a graph which exists in the form of the co-occurrence matrix and relates to the entity relationship.
The Louvain algorithm is a community discovery algorithm in graph theory, and the core algorithm principle is to calculate the modularity and the closeness of communities in a graph. On the one hand, the graph needs to be split into different parts, and on the other hand, the modularity (i.e. how good the quality) of the part needs to be measured. The difference between the Louvain algorithm and the common algorithm based on modularity and modularity gain is that the algorithm is fast, and the clustering effect on some graphs with few points and multiple edges is particularly obvious. By adopting the Louvain algorithm, the application scene with the data volume of up to millions per minute can be dealt with.
And the entity clustering module adjusts the fine granularity of the community in the graph by utilizing the resolution ratio R based on the region modularity and the compactness in the community discovery algorithm Louvain calculation graph. A larger R value setting may identify a graph as yielding smaller communities and a smaller R value may identify yielding larger communities. The modularity is represented by the weight A representing the connection between nodes I, jijSum of weights k of all edges connected to node ii=∑jAijAnd representing the sum of the weights of the entire network connection
Figure BDA0002861780160000072
Calculating to obtain the modularity, wherein the modularity is calculated by using a modularity calculation formula shown in formula (3):
Figure BDA0002861780160000073
where m represents the sum of the weights of the network connections, kiSum of weights, k, of all edges connected to node ijRepresents the sum of the weights, δ (c), of all edges connected to node ji,cj) Indicates whether the nodes i, j are in the same community (1 is taken at the same time, 0 is taken at different time), ci,cjIndicating the community number of the node i, j. Note that the value of the modularity Q is in the range of [ -1,1 []In the above paragraph. When i, j have no edges connected, we can consider Aij0, however other terms may be greater than 0; this setting means that adding a node in this way (but not connected to some point in the community) has a negative effect.
Specifically, the entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain computational graph:
what needs to be done in the first stage is classification, and first, each node is assigned a community number, that is, the network has N communities at this time, which is an initial state. Then, for each node i, let i and the community become j, considering its neighbor j, to see what the value of modularity is after this action is performed. If Δ Q is positive with this action, then the action is accepted, otherwise the original allocation pattern is maintained. Then, calculating the community modularity variation quantity delta Q according to a calculation formula shown in (4):
Figure BDA0002861780160000081
therein, sigmainIs the sum of the weights, Σ, of the connections within the communitytotIs the sum of the weights of all edges connected to the community.
And a second stage: and the entity clustering module re-initializes the new graph, namely the same class is the same node, repeats the evaluation process of the first stage on the modularity and the closeness, completes one iteration and terminates the condition reference evaluation model.
And the clustering link module regards the clustering results in two continuous time windows as the maximum matching problem of the bipartite graph, and adopts a KM algorithm capable of processing the maximum matching of the weighted bipartite graph to realize continuous tracking of the event.
Bipartite graphs, also known as bipartite graphs, and even graphs, refer to graphs in which vertices can be divided into two disjoint sets U and V, such that vertices in the same set are not adjacent (have no common edges). The weighted matching of the bipartite graph is to find a matching set, so that the sum of weights of edges in the set is maximum or minimum.
The KM algorithm is the algorithm used to find the best match for the weighted bipartite graph. The general description of the KM algorithm can be summarized in the following steps: and initializing the feasible benchmarks, searching for a complete match by using a Hungarian algorithm, modifying the feasible benchmarks if the complete match is not found, and repeating the step two and the step three until the complete match of the equal subgraphs is found. The feasibility marker post is used for solving a vertex value of a node for any node in the original graph L (node). The node vertex values in set X may be recorded by array Lx (X), and the node vertex values in set Y may be recorded by array Ly (Y). And any edge (x, y) in the original graph satisfies Lx (x) + Ly (y) ≧ weight (x, y).
The basic principle of the Hungarian algorithm is that an alternate path is taken from one unmatched point, the other unmatched point is taken as the tail, the head and the tail are all unmatched points, and the edges of the head and the tail are all non-matched edges. And are alternate paths, i.e., one more non-matching edge than matching edge. Then the matched edge and the non-matched edge in the augmented road can be called completely, so that one more matched edge is needed, and the augmented significance is realized. The core part of the hungarian algorithm is therefore: the extended paths are always found and continuously exchanged and matched.
The clustering link module searches for the best match of the weighted bipartite graph by adopting a KM algorithm, initializes a feasible benchmark, finds out clustering links, and modifies the feasible benchmark if complete match is not found; and continuously searching for complete matching by using a Hungary algorithm until the complete matching of equal subgraphs is found, giving a function L (node) by using a feasible standard bar of any node in the original image to find out a top standard value of the node, recording the top standard value of the node in the set X by using an array Lx (X) and an array Ly (Y) to record the top standard value of the node in the set Y, and starting from an unmatched point by using the Hungary algorithm, walking alternate paths and ending at another unmatched point by using a Hungary algorithm when any edge (X, Y) in the original image meets the requirement of Lx (X) + Ly (Y) not less than weight (X, Y). The clustering ranking module ranks the clustering results subjected to clustering linkage according to the number of hot words contained in the clustering results, wherein the higher the ranking is, the more urgent the event is; and the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A detection system for identifying a microblog data stream of a sudden event in real time comprises the following steps: entity extraction module, the other entity of having connect trend identification module of establishing ties in proper order filter the module, similarity calculation module, similarity filter module, cluster link module, cluster module and data storage module, construct a whole flow system from original microblog data stream to event detection, discernment and storage, its characterized in that: the entity extraction module is based on a RoBERTA-wwm-large-ext model, trains on an NER data set issued by a CLUE academic organization, and is used for extracting various types of named entities; crawling text data from official microblogs authenticated by provinces, cities and counties and various large V accounts in real time by using a crawler technology, and performing data cleaning on the crawled data; inputting the cleaned data into an entity extraction module, and extracting named entities contained in the data in real time by using the entity extraction module; the trend identification module takes the microblog as a outburstSending a data source of the event network public opinion, extracting named entities and geographical regions in microblog data, storing the named entities and the geographical regions in an entity, area and counting mode, and calculating by using a region-entity binary group to obtain a hot word list related to different regions; the entity filtering module continuously maintains the regional hot word list, periodically updates the hot word list, and filters out entities without heat degree by using the regional hot word list; the similarity calculation module calculates the word Frequency (Frequency) of the residual entities after the entity filtering, simultaneously establishes an entity co-occurrence matrix (co-occurrences) in a determined window, calculates the entity similarity by means of the word Frequency counting and the co-occurrence matrix, and constructs an entity relation Graph (Graph) by taking the numerical value of the entity similarity as an edge; the similarity filtering module filters edges with the similarity smaller than a threshold value S in the entity relationship graph; the entity clustering module calculates the modularity of Communities in the entity relation graph by using a Louvain algorithm, and adjusts the fine granularity of the Communities (Community) in the graph by using a resolution ratio R (resolution) to obtain a corresponding cluster set CT(ii) a The clustering link module collects clusters of the last time window CT-1Cluster set C with current time windowT-The clustering between adjacent clusters is regarded as a Bipartite Matching problem, and clustering basic elements (clusters) and events in each event window are continuously tracked to find out clustering links; the clustering grading module grades the clustering results after clustering linkage according to the number of hot words contained in the clustering results; and finally, the data storage module correspondingly stores the information such as cluster link, cluster grading and the like.
2. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the trend identification module extracts named entities according to the entity extraction model, and simultaneously establishes a region hot word list by combining with the geographical position information obtained in the data cleaning stage, and obtains a modularity and closeness evaluation model for scoring the region hot words, wherein the modularity and closeness evaluation model actually represents the occurrence frequency of a certain entity E in a certain region d, and E (d, E) actually represents the expected value of the occurrence frequency of the certain entity in the next time window as shown in formula (1):
Figure FDA0002861780150000011
storing each entity or hotword with the highest expected score in the memory for subsequent use,
where Ns denotes the count within a short time window, NlIndicates a count for a longer time window, d indicates a region, and e indicates a named entity.
3. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the similarity calculation module judges the similarity between different named entities and calculates the cosine similarity of the entities X and Y by adopting a similarity calculation formula (2) shown as follows:
Figure FDA0002861780150000021
4. the detection system for identifying the microblog data stream of the sudden event in real time according to claim 3, wherein: and the similarity filtering module filters the similarity between the entities, and if the similarity between the two named entities of the X and the Y is smaller than a threshold value S, the edge connecting the two entities in the entity relationship graph is deleted.
5. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the entity clustering module adjusts the fine granularity of the community in the graph by utilizing the resolution ratio R based on the region modularity and the compactness in the community discovery algorithm Louvain calculation graph, wherein the modularity is represented by the weight A of the connection between the nodes i and jijAnd the sum of the weights k of all edges connected to node ii=∑jAijAnd representing the sum of the weights of the entire network connection
Figure FDA0002861780150000022
Calculating the modularity by using a modularity calculation formula shown in formula (3):
Figure FDA0002861780150000023
where m represents the sum of the weights of the network connections, kiSum of weights, k, of all edges connected to node ijRepresents the sum of the weights, δ (c), of all edges connected to node ji,cj) Indicates whether the nodes i and j are in the same community (1 is taken at the same time, 0 is not taken at the same time), ci,cjIndicating the community number of the node i, j.
6. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 4, wherein: the entity clustering module is divided into two stages in the region modularity and the compactness in the community discovery algorithm Louvain calculation diagram:
the first stage is as follows: according to an evaluation model of modularity and closeness, distributing a community number for each node, classifying similar points into one class, making an identifier for each node, and calculating the community modularity variation delta Q according to a calculation formula shown in (4):
Figure FDA0002861780150000024
therein, sigmainIs the sum of the weights, Σ, of the connections within the communitytotIs the sum of the weights of all edges connected to the community.
7. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 5, wherein: and a second stage: and the entity clustering module re-initializes the new graph, namely the same class is the same node, repeats the evaluation process of the first stage on the modularity and the closeness, completes one iteration and terminates the condition reference evaluation model.
8. The detection system for identifying the microblog data stream of the sudden event in real time according to claim 1, wherein: the clustering link module searches for the best match of the weighted bipartite graph by adopting a KM algorithm, initializes a feasible benchmark, finds out clustering links, and modifies the feasible benchmark if complete match is not found; continuously searching for complete matching by using a Hungarian algorithm until complete matching of equal subgraphs is found, wherein a feasibility benchmarking means that a function Lnod is given to any node in an original graph to solve a top standard value of the node; the node vertex values in set X may be recorded by array Lx (X), and the node vertex values in set Y may be recorded by array Ly (Y).
9. The system of claim 8, wherein the system is configured to identify the stream of the microblog data of the sudden event in real time: the clustering link module utilizes a feasibility marker post of any node in the original image, a function Lnod is given to calculate a top mark value of the node, an array group Lx (X) is used for recording the top mark value of the node in the set X and an array Ly (Y) is used for recording the top mark value of the node in the set Y, and according to the condition that any edge (X, Y) in the original image meets Lx (X) + Ly (Y) and is not less than weight (X, Y), a Hungary algorithm is adopted to start from an unmatched point, alternate paths are taken, another unmatched point is taken as the end, an extended path is always found, and the exchange and matching are carried out continuously.
10. The system of claim 8, wherein the system is configured to identify the stream of the microblog data of the sudden event in real time: the Hungarian algorithm starts from one unmatched point, walks alternate paths, ends with the other unmatched point, has one more unmatched edge than the matched edge, always finds the matched edge and the unmatched edge in the augmented path, and continuously exchanges matching.
CN202011566168.0A 2020-12-25 2020-12-25 Detection system for identifying microblog data stream of sudden event in real time Pending CN112597309A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011566168.0A CN112597309A (en) 2020-12-25 2020-12-25 Detection system for identifying microblog data stream of sudden event in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011566168.0A CN112597309A (en) 2020-12-25 2020-12-25 Detection system for identifying microblog data stream of sudden event in real time

Publications (1)

Publication Number Publication Date
CN112597309A true CN112597309A (en) 2021-04-02

Family

ID=75202284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011566168.0A Pending CN112597309A (en) 2020-12-25 2020-12-25 Detection system for identifying microblog data stream of sudden event in real time

Country Status (1)

Country Link
CN (1) CN112597309A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076742A (en) * 2021-04-27 2021-07-06 东南大学 Entity disambiguation method based on ontology feature vocabulary in power grid monitoring field
CN114970491A (en) * 2022-08-02 2022-08-30 深圳市城市公共安全技术研究院有限公司 Text connectivity judgment method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1860735A (en) * 2003-08-15 2006-11-08 捷讯研究英国有限公司 Device and method for preserving service quality levels during hand-over in radio commmunication system
CN101005416A (en) * 2006-01-18 2007-07-25 中国科学院计算技术研究所 Cross department flow coodinate method based service rule
US20110257505A1 (en) * 2010-04-20 2011-10-20 Suri Jasjit S Atheromatic?: imaging based symptomatic classification and cardiovascular stroke index estimation
CN102402612A (en) * 2011-12-20 2012-04-04 广州中长康达信息技术有限公司 Video semantic gateway
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN107147770A (en) * 2016-03-01 2017-09-08 阿里巴巴集团控股有限公司 A kind of facility information collection method, apparatus and system
CN109670051A (en) * 2018-12-14 2019-04-23 北京百度网讯科技有限公司 Knowledge mapping method for digging, device, equipment and storage medium
CN110442726A (en) * 2019-08-15 2019-11-12 电子科技大学 Social media short text on-line talking method based on physical constraints

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1860735A (en) * 2003-08-15 2006-11-08 捷讯研究英国有限公司 Device and method for preserving service quality levels during hand-over in radio commmunication system
CN101005416A (en) * 2006-01-18 2007-07-25 中国科学院计算技术研究所 Cross department flow coodinate method based service rule
US20110257505A1 (en) * 2010-04-20 2011-10-20 Suri Jasjit S Atheromatic?: imaging based symptomatic classification and cardiovascular stroke index estimation
CN102402612A (en) * 2011-12-20 2012-04-04 广州中长康达信息技术有限公司 Video semantic gateway
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN107147770A (en) * 2016-03-01 2017-09-08 阿里巴巴集团控股有限公司 A kind of facility information collection method, apparatus and system
CN106940732A (en) * 2016-05-30 2017-07-11 国家计算机网络与信息安全管理中心 A kind of doubtful waterborne troops towards microblogging finds method
CN105956197A (en) * 2016-06-15 2016-09-21 杭州量知数据科技有限公司 Social media graph representation model-based social risk event extraction method
CN109670051A (en) * 2018-12-14 2019-04-23 北京百度网讯科技有限公司 Knowledge mapping method for digging, device, equipment and storage medium
CN110442726A (en) * 2019-08-15 2019-11-12 电子科技大学 Social media short text on-line talking method based on physical constraints

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MIZUKI OKA: "Anomaly Detection Using Layered Networks Based on Eigen Co-occurrence Matrix", 《RECENT ADVANCES IN INTRUSION DETECTION》 *
孙莉等: "基于微博文本和元数据的话题检测", 《计算机应用与软件》 *
王生生等: "基于深度学习和复杂空间关系特征的多尺度遥感图像检索", 《东北师大学报(自然科学版)》 *
陈兴蜀等: "基于ICE-LDA模型的中英文跨语言话题发现研究", 《工程科学与技术》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076742A (en) * 2021-04-27 2021-07-06 东南大学 Entity disambiguation method based on ontology feature vocabulary in power grid monitoring field
CN114970491A (en) * 2022-08-02 2022-08-30 深圳市城市公共安全技术研究院有限公司 Text connectivity judgment method and device, electronic equipment and storage medium
CN114970491B (en) * 2022-08-02 2022-10-04 深圳市城市公共安全技术研究院有限公司 Text connectivity judgment method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN112199608B (en) Social media rumor detection method based on network information propagation graph modeling
Cai et al. What are popular: exploring twitter features for event detection, tracking and visualization
CN103795613B (en) Method for predicting friend relationships in online social network
WO2022134794A1 (en) Method and apparatus for processing public opinions about news event, storage medium, and computer device
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN105224593B (en) Frequent co-occurrence account method for digging in the of short duration online affairs of one kind
CN111008337B (en) Deep attention rumor identification method and device based on ternary characteristics
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN110851621B (en) Method, device and storage medium for predicting video highlight level based on knowledge graph
Jin et al. Selective cross-city transfer learning for traffic prediction via source city region re-weighting
CN112597309A (en) Detection system for identifying microblog data stream of sudden event in real time
CN107918657A (en) The matching process and device of a kind of data source
KR102086248B1 (en) Method and system for detecting graph based event in social networks
CN110287329A (en) A kind of electric business classification attribute excavation method based on commodity text classification
US20140047089A1 (en) System and method for supervised network clustering
CN113254652B (en) Social media posting authenticity detection method based on hypergraph attention network
CN103761286B (en) A kind of Service Source search method based on user interest
CN108595582A (en) A kind of disastrous meteorological focus incident recognition methods based on social signal
CN112258254A (en) Internet advertisement risk monitoring method and system based on big data architecture
CN113779429A (en) Traffic congestion situation prediction method, device, equipment and storage medium
CN109033351A (en) The merging method and device of merit data
CN108446333A (en) A kind of big data text mining processing system and its method
Zheng et al. Learning‐based topic detection using multiple features
CN112560105B (en) Joint modeling method and device for protecting multi-party data privacy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210402