CN112257429B - Microblog emergency detection method based on BERT-BTM network - Google Patents

Microblog emergency detection method based on BERT-BTM network Download PDF

Info

Publication number
CN112257429B
CN112257429B CN202011109749.1A CN202011109749A CN112257429B CN 112257429 B CN112257429 B CN 112257429B CN 202011109749 A CN202011109749 A CN 202011109749A CN 112257429 B CN112257429 B CN 112257429B
Authority
CN
China
Prior art keywords
word
event
bert
emergency
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011109749.1A
Other languages
Chinese (zh)
Other versions
CN112257429A (en
Inventor
韩忠明
黄楚蓉
段大高
张翙
李俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202011109749.1A priority Critical patent/CN112257429B/en
Publication of CN112257429A publication Critical patent/CN112257429A/en
Application granted granted Critical
Publication of CN112257429B publication Critical patent/CN112257429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a microblog emergency detection method based on a BERT-BTM network, which comprises the steps of reading a microblog data set, and processing the microblog data set to obtain an original data set; carrying out vectorization processing on the original data set to obtain a vectorized word vector set, and then processing the basic BERT word vector set by calling a pre-training BERT model to obtain a BERT word vector set; constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model; and constructing a BERT-BTM network, and dividing the BERT-BTM network to finish the emergency detection. The method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the existing microblog emergency detection method, and improves the emergency detection efficiency.

Description

Microblog emergency detection method based on BERT-BTM network
Technical Field
The invention relates to the field of text detection, in particular to a microblog-oriented emergency recognition method.
Background
The emergent public event relates to a plurality of fields of society, politics, economy, culture and the like of modern life, and covers a plurality of issues of medical treatment, education, law, entertainment and the like. The detection of the emergency event not only can improve public attention, but also is beneficial to related applications such as public opinion mining, emerging topic detection, topic clue tracking and the like. Based on the description, it is significant to design a more accurate and effective method for detecting the emergency event of the social network platform such as the microblog.
The current microblog emergency detection task has several problems to be solved urgently: on one hand, the traditional method has the problem that short text features are sparse and the word ambiguity cannot be solved. On the other hand, after obtaining the event topic of the document by using the topic model, researchers usually use clustering algorithms such as K-means, and the like, and the method has the problems that a plurality of iterations are needed, the number of clusters is needed to be designated, the efficiency is low, and the emergency detection cannot be completed quickly.
Disclosure of Invention
The invention aims to provide a microblog emergency detection method based on a BERT-BTM network, which aims to solve the problems that short text data are sparse and word ambiguity cannot be solved in the existing microblog emergency detection method. The invention discloses a microblog emergency detection method based on a BERT-BTM network, which comprises the following steps:
s1, reading a microblog data set, performing word segmentation on the microblog data set, and then removing stop words to obtain an original data set;
s2, carrying out vectorization on the original data set to obtain a vectorized word vector set, and then, carrying out processing on the vectorized word vector set by calling a pre-training BERT model to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the dirichlet priori parameter alpha and the priori parameter beta fused with the BERT word vector set i Constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection.
Preferably, the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the dirichlet priors alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to the prior parameter beta fused with the BERT word vector set i Calculating event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair based on the event z and the event word distribution phi i w j
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into the BERT-BTM model to obtain output data;
the input data includes the number of events, the number of iterations, the alpha, the beta i Word pair set, dictionary size;
the output data includes an incident profile;
the number of the input events is the number of the events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original dataset does not repeat.
Preferably, the step S3.2 specifically includes:
s3.2.1, obtaining event distribution of the microblog data set: to Dir ();
s3.2.2 obtaining the word distribution of event z z z Dir( i )
S3.2.3 obtaining a probability distribution of word pairs and word pair sets.
Preferably, the S3.2.3 method for obtaining the probability distribution of the word pairs and the word pair sets is as follows:
(a) Obtaining event z: z-Multi ();
(b) Obtaining word w i w j w i w j Multi( z )
(c) According to the word w i w j Obtaining a word pair b: b= (w) i w j )
Preferably, the step S3.3 specifically includes:
s3.3.1 randomly assigning a topic to the word pair b;
s3.3.2, performing N iterations, and processing each word pair B of the word pair set B;
s3.3.3 calculating an event distribution p (z) and an event-word distribution p (w) of the raw dataset z )
In the two formulas described above, the first and second compounds,representing the number of times event z is assigned to word pair b; n is n b Representing the number of word pairs in the original dataset; t (T) Representing the number of events; />Indicating that event z is assigned to word w i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size;
s3.3.4 according to said p (z), p (w z ) Obtaining a word pair-event distribution p (z|b):
wherein p (w) i Z) represents the word w corresponding to event z i Probability distribution of p (w) j Z) represents the word w corresponding to event z j Probability distribution of (2);
s3.3.5, calculating a document-word pair distribution p (b|d) in the original dataset:
wherein n is d(b) Is the frequency with which word pair b appears in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating a document-event distribution P (z|d) according to the word pair-event distribution P (z|b) and the document-word pair distribution P (b|d):
where P (z|b) is a word pair-event distribution, P (b|d) is a document-word pair distribution, and P (z|d) is a document-event distribution.
Preferably, the emergency distribution includes: the document-event distribution, event-word distribution;
according to the emergency distribution, obtaining words in the corresponding emergency word set of the current document through document-event distribution; and obtaining a corresponding emergency word set through event-word distribution.
Preferably, the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network uses a data format NET file representation;
the words in the emergency word set are used as nodes in the network;
and the co-occurrence relation among the words in the emergency word set is used as an edge among the network nodes.
Preferably, the method for dividing the BERT-BTM network comprises the following steps: using GN algorithm to remove the edge with highest edge medium number continuously and divide the BERT-BTM network;
the GN algorithm flow is as follows:
sequentially calculating the edge betweenness of each edge in the BERT-BTM network to be mined; finding the edge with the largest edge betweenness in the BERT-BTM network and then deleting the edge; recalculating the edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
the emergency word community clusters the corresponding n microblog events in the emergency word set by taking the emergency word set as a clustering center point to obtain a final emergency cluster.
Preferably, the clustering method is single-side clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is larger than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
Preferably, the step of calculating the similarity S is as follows:
setting two word sets to be C, H, and introducing similarity introduction function of word set C relative to H as R (C,H)
The similarity introducing function of the word set H relative to C is R (H,C)
C and H similarity S (C,H) The method comprises the following steps:
when the similarity between H and C is S (C,H) When the threshold value is larger than the threshold value, the H and the C are considered to be similar, and the events with the similarity larger than the threshold value are distributed into the same emergency cluster to finish the detection of the emergency.
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the existing microblog emergency detection method, and greatly improves the emergency detection efficiency. The technical scheme of the invention can acquire more accurate microblog emergency, and is beneficial to relevant departments to timely track subsequent event clues and control the fermentation of the event while acquiring more accurate microblog emergency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for detecting microblog emergency events based on a BERT-BTM network;
FIG. 2 is a diagram of the BERT-BTM model.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a method for detecting microblog emergency events based on a Bert and BTM network, which specifically comprises the following steps:
step 1: the microblog data set is read, and the collected data set is as follows:
[ Hua is the first day of chip outage, and the chip will not be supplied anymore to Hua for the future of 3000 hundred million dollars of Chinese chips ] 9 months 15 days, including Taiwan electricity, high pass, three stars, SK Hailishi, mei Guang and the like.
[ surprise ]! The drunk man overpass falls and is caught by the roof of a driver for 9 months and 13 days, and the drunk man in a certain city climbs out of the overpass, so that the situation is very critical. A van driver detects abnormality, drives the van to the lower part of a man, and catches the man at the moment of falling.
[ Hua is the first day of chip outage ] 15 days of America, the ban is effective, ni Guangna is said to be that Hua cannot be used without cores.
[ drunk man overpass falls and is caught by the driver with the vehicle ] outside a city-man hanging five meters high overpass, at a key moment, a hot citizen stops the minibus to catch the falling man.
[ Beijing cash powder Charyu ] 15 evening, beijing sky presents gold powder color under the irradiation of sunset ]! Such sky really loves-! This is also almost the case in summer at the end of the year-!
Filtering noise data such as special characters of an html tag and the like in a text data set through regular matching to obtain a cleaned text sequence, and then using a word segmentation tool to segment the cleaned text sequence, wherein a word segmentation device selects an open source tool ICTCLAS word segmentation system to obtain a segmented sequence; and then removing stop words in the microblog data set according to the stop word list, and storing the processed data set to obtain an original data set.
Raw data set:
huacheng/chip/outage/first day/china/chip/billion dollars/future/electricity/high-pass/samsung/sea/lishi/meiguang/no longer/supply/chip/give/Huacheng
Frightening/drunk/man/overpass/fall/driver/roof/catch/city/drunk/man/climb to/overpass/out/situation/ten/critical/van/driver/find/abnormal/car/open/man/down/man/fall/instant/catch/go
Hua Cheng/Chi/off-supply/first day/United states/ban/take effect/Ni Guangna/claim/Hua Cheng/don't/coreless/available
Drunk, man, overpass, fall, driver, roof, catch, city, man, hang, five meters, overpass, key, moment, hot, citizen, stop, minibus, catch, crash, man
Beijing/present/gold/pink/evening/Beijing/sky/sunset/irradiation/presentation/gold/colour/sky/near/summer/ending
Step 2: and carrying out vectorization processing on the original data set to obtain a vectorized word vector set, and then carrying out processing on the vectorized word vector set by calling a pre-training BERT model to obtain the BERT word vector set. And calling a pre-trained BERT model through an API interface at the client to acquire the BERT word vector set.
The BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog event.
Step 3, constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set:
s3.1, a BERT-BTM topic model as shown in FIG. 2 is proposed. Obtaining event distribution theta in the microblog data set according to the dirichlet priors alpha, and obtaining an event z corresponding to the event distribution theta according to the event distribution theta; according to the prior parameter beta fusing the BERT word vector set i Obtaining the event word distribution corresponding to the event zBased on the event z and event word distributionObtaining 2 different words w forming a word pair i w j The method comprises the steps of carrying out a first treatment on the surface of the Event word distribution->Composing the number k of input events; event z and word w i w j And forming word pair sets.
S3.2, processing the original data set by using a BERT-BTM model, wherein the method specifically comprises the following steps:
s3.2.1, obtaining event distribution of the microblog data set: to Dir ();
s3.2.2 obtaining the word distribution of event z z z Dir( i )
S3.2.3, obtaining word pairs:
(a) Obtaining event z: z-Multi ();
(b) Obtaining word w i w j w i w j Multi( z )
According to the word w i w j Obtaining a word pair b: b= (w) i w j )
Calculating the conditional probability of the word pair b:
where P (b) is the conditional probability of word pair b, P (z) = z Representing the probability distribution of event z, P (w i |z) i|z Word w corresponding to the representation event z i Probability distribution of P (w) j |z) j|z Word w corresponding to the representation event z j Is a probability distribution of (c).
Calculating the probability of the word pair set B:
where P is the probability distribution of word pair set B, z Representing the probability distribution of event z, phi i|z Word w corresponding to the representation event z i Probability distribution, phi j|z Word w corresponding to the representation event z j Is a probability distribution of (c).
Step S3.3BERT- and of the BTM model are inferred using Gibbs sampling methods. The Gibbs sampling method is an efficient markov chain-monte carlo MCMC sampling method that uses a conditional distribution of each variable to achieve sampling in a joint distribution. The BERT-BTM model extrapolates the document-event distribution steps as follows:
input data: number of events, number of iterations, alpha and beta i Word pair sets, dictionary sizes;
the number of the input events is the number of the events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
outputting data: document-event distribution. The method specifically comprises the following steps:
s3.3.1 randomly assigning a topic to the word pair b;
s3.3.2, performing N iterations, and processing each word pair B of the word pair set B;
calculate the word pair b= (w i w j ) Conditional probability distribution of (c):
wherein: z represents the event assignment of the word pair B, z-B represents the event assignment of the word pair set B excluding the word pair B;representing the number of times event z is assigned to the word pair b; />Indicating that event z is assigned to w i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size, i.e. the number of words for which the original dataset is not repeated.
Updating
S3.3.3 calculating an event distribution p (z) and an event-word distribution p (w) of the raw dataset z )
In the two formulas described above, the first and second compounds,representing the number of times event z is assigned to word pair b; n is n b Representing the number of word pairs in the original dataset; t (T) Representing the number of events; />Indicating that event z is assigned to word w i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size.
S3.3.4 according to said p (z), p (w z ) Calculating to obtain word pair-event distribution p (z|b):
wherein p (w) i Z) represents the word w corresponding to event z i Probability distribution of p (w) j Z) represents the word w corresponding to event z j Probability distribution of (2);
s3.3.5, calculating a document-word pair distribution p (b|d) in the original dataset:
wherein n is d(b) Is the frequency with which word pair b appears in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating a document-event distribution P (z|d) from the word pair-event distribution P (z|b) and the document-word pair distribution P (b|d):
where P (z|b) is a word pair-event distribution, P (b|d) is a document-word pair distribution, and P (z|d) is a document-event distribution.
The emergency distribution includes: the document-event distribution, event-word distribution.
Mapping the word vector set into an event vector set to obtain emergency distribution, and obtaining the words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; the corresponding emergency vocabulary is obtained through event-word distribution, and examples are as follows in tables 1 and 2:
TABLE 1
TABLE 2
The best topic number k=3 is obtained from the confusion, and the document-event distribution, the event distribution and the event-word distribution are respectively obtained (only the first 3 words with the largest proportion are reserved).
And 4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection.
The BERT-BTM network construction method is specifically as follows.
And constructing the BERT-BTM network by using the words in the emergency set obtained from the BERT-BTM model as points in the network and the co-occurrence relationship between the words in the emergency set as edges. The BERT-BTM network is represented using a data format NET file commonly used in complex networks, which defines all points and edges in the network. The NET file contains two parts of content, namely, vertictics and Edges, where vertictics describes nodes in the BERT-BTM network, and Edges describe Edges between nodes in the BERT-BTM network. Assuming { A, B, C } is the set of emergency words obtained from the microblog data set, the set is expressed in NET format, and the structure is shown in tables 3 and 4.
TABLE 3 Table 3
Vertices
Node ID Node label
1 A
2 B
3 C
TABLE 4 Table 4
Edges
Start node ID Endpoint node ID
1 2
1 3
2 3
And integrating the emergency word set obtained from the microblog data set into a node set VerticeSet and an edge set EdgesSet, and sequentially outputting the two sets into a NET file to obtain a BERT-BTM network, as shown in tables 5 and 6.
TABLE 5
Vertices
Node ID Node label
1 Huawei
2 Chip
3 Drunk wine
n Fall down
TABLE 6
Edges
Start node ID Endpoint node ID
1 2
1 5
1 13
n 9
And dividing the network by adopting a GN algorithm, thereby discovering the emergency. The specific method comprises the following steps:
the GN algorithm classifies the network by continuously removing the edge with the highest edge betweenness when executing event detection tasks, and the GN algorithm flow is as follows:
sequentially calculating the edge betweenness of each edge in the BERT-BTM network to be mined; finding the edge with the largest edge betweenness in the BERT-BTM network and then deleting the edge; recalculating the edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
after the emergency community is obtained through the GN algorithm, the words (emergency word sets) in the same community are used as clustering center points, and the corresponding n microblog events in the emergency word sets are clustered to find the microblog emergency clusters under the same microblog emergency. When clustering is carried out, a single-pass clustering method is used, the similarity S between the microblog event and the microblog emergency word set is calculated, and when the similarity S between the microblog event and the microblog emergency word set is larger than a threshold value, the microblog is considered to be the microblog describing the emergency.
Let C, H be the two word sets c= { C1, C2, C3, , ct }, h= { H1, H2, H3, , hm }. When calculating the similarity of two word sets, introducing a function R (C,H) The similarity of the word set C relative to H is expressed as follows:
further, define C and H similarity S (C,H) The method comprises the following steps:
similarity S between H and C (C,H) When the similarity is greater than a certain threshold, the H and the C are considered to be similar, and microblog texts with similarity greater than the threshold are distributed into the same microblog emergency cluster to complete detection of the microblog emergency. The results are exemplified as follows:
clusters corresponding to each microblog (the clusters are denoted by the reference numerals 1 to 3, the reference numeral 1 corresponds to the first cluster, and so on):
obtaining event 1 described by the 1 st and 3 rd microblogs; the 2 nd and 4 th microblogs describe event number 2; microblog 5 describes event number 3 as shown in table 7.
TABLE 7
Microblog number Reference numerals
1 1
2 2
3 1
4 2
5 3
The incidents described by each cluster are represented by several feature words, as shown in table 8, examples are as follows:
TABLE 8
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the existing microblog emergency detection method, and greatly improves the emergency detection efficiency. The technical scheme of the invention can acquire more accurate microblog emergency, and is beneficial to relevant departments to timely track subsequent event clues and control the fermentation of the event while acquiring more accurate microblog emergency.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. The method for detecting the microblog emergency event based on the BERT-BTM network is characterized by comprising the following steps of:
s1, reading a microblog data set, performing word segmentation on the microblog data set, and then removing stop words to obtain an original data set;
s2, carrying out vectorization on the original data set to obtain a vectorized word vector set, and then, carrying out processing on the vectorized word vector set by calling a pre-training BERT model to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the dirichlet priori parameter alpha and the priori parameter beta fused with the BERT word vector set i Constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection;
the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the dirichlet priors alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to the prior parameter beta fused with the BERT word vector set i Calculating event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair based on the event z and the event word distribution phi i w j
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into the BERT-BTM model to obtain output data;
the input data includes the number of events, the number of iterations, the alpha, the beta i Word pairsCollection, dictionary size;
the output data is in emergency distribution;
the event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original dataset does not repeat;
the step S3.3 specifically includes:
s3.3.1 randomly assigning a topic to the word pair b;
s3.3.2, performing N iterations, and processing each word pair B of the word pair set B;
s3.3.3 calculating an event distribution p (z) and an event-word distribution p (w) z )
In the two formulas described above, the first and second compounds,representing the number of times event z is assigned to word pair b; n is n b Representing the number of word pairs in the original dataset; t (T) Representing the number of events; />Indicating that event z is assigned to word w i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size;
s3.3.4 according to said p (z), p (w z ) Obtaining a word pair-event distribution p (zb):
wherein p (w) i z) represents the word w corresponding to event z i Probability distribution of p (w) j z) represents the word w corresponding to event z j Probability distribution of (2);
s3.3.5, calculating to obtain a document-word pair distribution p (bd):
wherein n is d(b) Is the frequency with which word pair b appears in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating a document-event distribution P (zd) according to the word pair-event distribution P (zb) and the document-word pair distribution P (bd):
where P (zb) is a word pair-event distribution, P (bd) is a document-word pair distribution, and P (zd) is a document-event distribution.
2. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps: the step S3.2 specifically includes:
s3.2.1, obtaining event distribution of the microblog data set: to Dir ();
s3.2.2 obtaining the word distribution of event z z z Dir( i )
S3.2.3, obtaining word pairs.
3. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:
the S3.2.3 obtains word pairs:
(a) Obtaining event z: z-Multi ();
(b) Obtaining word w i w j w i w j Multi( z )
(c) According to the word w i w j Obtaining a word pair b: b= (w) i w j )
4. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:
the emergency distribution includes: the document-event distribution, event-word distribution;
according to the emergency distribution, obtaining words in the corresponding emergency word set of the current document through document-event distribution; and obtaining a corresponding emergency word set through event-word distribution.
5. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:
the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network uses a data format NET file representation;
the words in the emergency word set are used as nodes in the network;
and the co-occurrence relation among the words in the emergency word set is used as an edge among the network nodes.
6. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:
the method for dividing the BERT-BTM network comprises the following steps: using GN algorithm to remove the edge with highest edge medium number continuously and divide the BERT-BTM network;
the GN algorithm flow is as follows: sequentially calculating the edge betweenness of each edge in the BERT-BTM network to be mined; finding the edge with the largest edge betweenness in the BERT-BTM network and then deleting the edge; recalculating the edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
the emergency word community clusters the corresponding n microblog events in the emergency word set by taking the emergency word set as a clustering center point to obtain a final emergency cluster.
7. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 6, wherein the method comprises the following steps:
the clustering method is single-side clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is larger than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
8. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 7, wherein the method comprises the following steps:
the step of calculating the similarity S is as follows: setting two word sets to be C, H, and introducing similarity introduction function of word set C relative to H as R (C,H)
The similarity introducing function of the word set H relative to C is R (H,C)
C and H similarity S (C,H) The method comprises the following steps:
when the similarity between H and C is S (C,H) Above a certain threshold, H is considered similar to C, with similarity greater than the thresholdAnd distributing the events with the values to the same emergency cluster to finish the detection of the emergency.
CN202011109749.1A 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network Active CN112257429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011109749.1A CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011109749.1A CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Publications (2)

Publication Number Publication Date
CN112257429A CN112257429A (en) 2021-01-22
CN112257429B true CN112257429B (en) 2024-04-16

Family

ID=74244414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011109749.1A Active CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Country Status (1)

Country Link
CN (1) CN112257429B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032557B (en) * 2021-02-09 2024-03-29 北京工业大学 Microblog hot topic discovery method based on frequent word sets and BERT semantics
CN117520484B (en) * 2024-01-04 2024-04-16 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104573031A (en) * 2015-01-14 2015-04-29 哈尔滨工业大学深圳研究生院 Micro blog emergency detection method
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN107273496A (en) * 2017-06-15 2017-10-20 淮海工学院 A kind of detection method of micro blog network region accident

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104573031A (en) * 2015-01-14 2015-04-29 哈尔滨工业大学深圳研究生院 Micro blog emergency detection method
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN107273496A (en) * 2017-06-15 2017-10-20 淮海工学院 A kind of detection method of micro blog network region accident

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
泛系运筹:时代变革和世界新的科技・军事・教育革命;吴学谋;;计算机与数字工程(第12期);全文 *

Also Published As

Publication number Publication date
CN112257429A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
Zhao et al. Image parsing with stochastic scene grammar
CN112257429B (en) Microblog emergency detection method based on BERT-BTM network
CN109815336B (en) Text aggregation method and system
CN111581046A (en) Data anomaly detection method and device, electronic equipment and storage medium
WO2020108430A1 (en) Weibo sentiment analysis method and system
CN110569920B (en) Prediction method for multi-task machine learning
CN111125469B (en) User clustering method and device of social network and computer equipment
EP3620982B1 (en) Sample processing method and device
CN106959946B (en) Text semantic feature generation optimization method based on deep learning
CN103150383B (en) A kind of event evolution analysis method of short text data
CN109726402B (en) Automatic extraction method for document subject term
CN112784929A (en) Small sample image classification method and device based on double-element group expansion
CN113628201A (en) Deep learning-based pathological section analysis method, electronic device and readable storage medium
CN111428479B (en) Method and device for predicting punctuation in text
CN112560490A (en) Knowledge graph relation extraction method and device, electronic equipment and storage medium
CN107908749A (en) A kind of personage's searching system and method based on search engine
CN112069402A (en) Personalized comment recommendation method based on emotion and graph convolution neural network
CN109871889B (en) Public psychological assessment method under emergency
CN112052869A (en) User psychological state identification method and system
CN113191144B (en) Network rumor recognition system and method based on propagation influence
CN113850811B (en) Three-dimensional point cloud instance segmentation method based on multi-scale clustering and mask scoring
CN115481255A (en) Multi-label text classification method and device, electronic equipment and storage medium
CN115757776A (en) Traffic safety public opinion analysis method based on SQ-LDA topic model
CN114169433A (en) Industrial fault prediction method based on federal learning + image learning + CNN
CN114169340A (en) Cognition method and system based on fusion of multi-mode data matrix

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant