CN112257429A - BERT-BTM network-based microblog emergency detection method - Google Patents

BERT-BTM network-based microblog emergency detection method Download PDF

Info

Publication number
CN112257429A
CN112257429A CN202011109749.1A CN202011109749A CN112257429A CN 112257429 A CN112257429 A CN 112257429A CN 202011109749 A CN202011109749 A CN 202011109749A CN 112257429 A CN112257429 A CN 112257429A
Authority
CN
China
Prior art keywords
word
event
bert
emergency
btm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011109749.1A
Other languages
Chinese (zh)
Other versions
CN112257429B (en
Inventor
韩忠明
黄楚蓉
段大高
张翙
李俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN202011109749.1A priority Critical patent/CN112257429B/en
Publication of CN112257429A publication Critical patent/CN112257429A/en
Application granted granted Critical
Publication of CN112257429B publication Critical patent/CN112257429B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for detecting microblog emergency based on a BERT-BTM network, which comprises the steps of reading a microblog data set, processing the microblog data set and obtaining an original data set; vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the basic BERT word vector set to obtain a BERT word vector set; constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model; and constructing a BERT-BTM network, and then dividing the BERT-BTM network to finish the detection of the emergency. The method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and improves the emergency detection efficiency.

Description

BERT-BTM network-based microblog emergency detection method
Technical Field
The invention relates to the field of text detection, in particular to a microblog-oriented emergency identification method.
Background
With the rapid development of information technology in China, social network platforms generated by microblogs, Twitter, Facebook and the like have become main sources and important media for generating big data and emergencies, and the platforms become the first publishers of major emergencies such as natural disasters and terrorist incidents for many times. The emergent public events relate to social, political, economic and cultural fields of modern life and cover a plurality of issues of medical treatment, education, law, entertainment and the like. The detection of the emergency can not only improve the public attention, but also be beneficial to public opinion mining, emerging topic detection, topic clue tracking and other related applications. Based on the above description, it is significant to design a more accurate and effective method for detecting an emergency event on social network platforms such as a microblog.
The current microblog emergency detection task has several problems to be solved urgently: on one hand, the traditional method has the problems that the short text features are sparse and the word ambiguity cannot be solved. On the other hand, after the topic model is used for obtaining the event topic of the document, researchers usually use a clustering algorithm such as K-means, and the like, and the method has the problems that multiple iterations are needed, the number of clustering clusters needs to be specified, the efficiency is low, and the detection of an emergency cannot be completed quickly.
Disclosure of Invention
The invention aims to provide a BERT-BTM network-based microblog emergency detection method, which aims to solve the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method. The invention discloses a BERT-BTM network-based microblog emergency detection method, which comprises the following steps of:
s1, reading a microblog data set, performing word segmentation processing on the microblog data set, and then removing stop words to obtain an original data set;
s2, vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the Dirichlet prior parameter alpha and the prior parameter beta fused with the BERT word vector setiConstructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, according to the co-occurrence relation between the words in the emergency word set and the words in the emergency word set, building a BERT-BTM network, and then dividing the BERT-BTM network to complete the detection of the emergency.
Preferably, the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the Dirichlet prior parameter alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to a prior parameter beta fused with the BERT word vector setiCalculating the event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair according to the event z and the event word distribution phii、wj
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into a BERT-BTM model to obtain output data;
the input data comprises the number of events, the number of iterations, the alpha, the betaiWord pair set, dictionary size;
the output data comprises an incident distribution;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original data set does not repeat.
Preferably, the step S3.2 specifically includes:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining probability distribution of the word pairs and the word pair sets.
Preferably, the S3.2.3 method for obtaining the probability distribution of the word pairs and the word pair set includes:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
(c) According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj);
Preferably, the step S3.3 specifically includes:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w) of the original data setz):
Figure BDA0002728202760000041
Figure BDA0002728202760000042
In the above-mentioned two formulas, the first and second groups,
Figure BDA0002728202760000043
representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαRepresenting the number of events;
Figure BDA0002728202760000044
indicating that event z is assigned to word wiThe number of times of (c);
Figure BDA0002728202760000045
representing the number of times the word w is assigned to the event z; m represents a dictionary size; s3.3.4, according to said p (z), p (w)z) Get word pair-event distribution p (z | b):
Figure BDA0002728202760000046
wherein, p (w)iz) represents the word w corresponding to the event ziProbability distribution of p (w)jz) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating the document-word pair distribution p (b | d) in the original data set:
Figure BDA0002728202760000047
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating to obtain a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
Figure BDA0002728202760000048
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
Preferably, the emergency distribution includes: the document-event distribution, event-word distribution;
obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; and obtaining a corresponding emergency word set through event-word distribution.
Preferably, the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network is represented by a NET file in a data format;
the words in the emergency word set are used as nodes in the network;
and taking the co-occurrence relation between the words in the emergency word set as the edges between the network nodes.
Preferably, the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;
the GN algorithm flow is as follows:
sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
and the emergency word community takes the emergency word set as a clustering central point, and clusters the n corresponding microblog events in the emergency word set to obtain a final emergency cluster.
Preferably, the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
Preferably, the step of calculating the similarity S is as follows:
let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R(C,H):,
Figure BDA0002728202760000051
Sets of words H relative to CThe similarity introducing function is R(H,C)
Figure BDA0002728202760000061
Similarity of C and H S(C,H)Comprises the following steps:
Figure BDA0002728202760000062
when the similarity between H and C is S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the event with the similarity larger than the threshold value is distributed to the same emergency cluster to complete the detection of the emergency.
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and greatly improves the emergency detection efficiency. According to the technical scheme, more accurate microblog emergencies can be obtained, and meanwhile, the follow-up event clues can be tracked timely by related departments, and the fermentation of events can be controlled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for detecting microblog emergency based on a BERT-BTM network;
FIG. 2 is a diagram of the structure of the BERT-BTM model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for detecting microblog emergency based on Bert and BTM networks, which comprises the following steps:
step 1: reading a microblog data set, wherein the acquired data set comprises the following steps:
in 9, 15 days, including Taiwan power, high-pass, Samsung, SK, Haishi, Meiguang, etc., the chips are not supplied to Huacheng.
[ surprise! The drunk male overbridge falls and is caught by the driver through the car roof for 9 months and 13 days, and the drunk male in Wuhan climbs outside the overbridge, so that the condition is very critical. A van driver finds an abnormality, drives the vehicle below a man, and catches the man at the moment of falling.
"Hua is the first date of chip outage" -9.15 Ri American ban became effective, Niguann states that Hua would not be coreless.
The overpass of the drunken men is caught by the roof of the driver, Wuhan men hang the overpass with five meters, and at the critical moment, the citizen with great concentration stops the minibus to catch the overbridge men.
[ gold pink sunset in Beijing ] 15 days in the evening, and the sky in Beijing under the sun's illumination, the color of gold pink! Such sky really loves! This is almost always the end of summer!
Filtering noise data such as html label special characters contained in a text data set through regular matching to obtain a cleaned text sequence, performing word segmentation on the cleaned text sequence by using a word segmentation tool, and selecting an open source tool ICTCCLAS word segmentation system by using a word segmentation device to obtain a word segmented sequence; and then removing stop words in the microblog data set according to the stop word list, and storing the processed data set to obtain an original data set.
Raw data set:
hua ye/chip/outage/first day/china/chip/hundred million dollars/future/electricity/high pass/samsung/sea/lishi/beauty light/no more/supply/chip/give/hua ye
Fright/drunk/man/overpass/fall/driver/roof/catch/martial/drunk/man/climb to/overpass/out/situation/emergency/crisis/van/driver/sniff/abnormal/car/drive to/man/down/man/fall/instant/take/catch/car/man/down/man/fall/instant/go/catch/go
Hua is/chip/outage/first date/usa/ban/effect/nihonam/title/hua is/don't care/coreless/available
Drunk/man/overpass/fall/driver/roof/catch/wuhan/man/hang/five meters/high overpass/key/time/enthusiasm/citizen/stop/minibus/catch/fall bridge/man
Beijing/appeared/gold/pink/sunset/evening/Beijing/sky/sunset/shine/appeared/gold powder/color/sky/nearly/summer/ending
Step 2: vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain the BERT word vector set. And calling a pre-training BERT model through an API (application programming interface) at the client to obtain a BERT word vector set.
The BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog event.
Step 3, constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set:
s3.1, the BERT-BTM topic model shown in the figure 2 is proposed. Obtaining event distribution theta in the microblog data set according to a Dirichlet prior parameter alpha, and obtaining an event z corresponding to the event distribution theta according to the event distribution theta; according to prior parameter beta fused with the BERT word vector setiObtaining the event word distribution corresponding to the event z
Figure BDA0002728202760000091
According to the event z and the event word scoreCloth
Figure BDA0002728202760000092
Obtaining 2 different words w constituting a word pairi、wj(ii) a Event word distribution
Figure BDA0002728202760000093
Composing an input event number k; event z and word wi、wjAnd forming a word pair set.
S3.2, processing the original data set by using a BERT-BTM model, which specifically comprises the following steps:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining word pairs:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj);
Calculating the conditional probability of the word pair b:
Figure BDA0002728202760000101
where p (b) is the conditional probability of word pair b, and p (z) ═ θzRepresenting the probability distribution, P (w), of an event zi|z)=φi|zWord w corresponding to the representation event ziProbability distribution of (1), P (w)j|z)=φjzWord w corresponding to the representation event zjProbability distribution of (2).
Calculating the probability of the word pair set B:
Figure BDA0002728202760000102
where P is the probability distribution of the set of word pairs B, θzRepresents the probability distribution, phi, of an event zi|zWord w corresponding to the representation event ziProbability distribution of (phi)j|zWord w corresponding to the representation event zjProbability distribution of (2).
Step S3.3BERT-Theta and φ of the BTM model were inferred using the Gibbs sampling method. The Gibbs sampling method is an efficient markov chain-monte carlo MCMC sampling method that uses the conditional distribution of each variable to achieve sampling in a joint distribution. The step of inferring the document-event distribution by the BERT-BTM model is as follows:
inputting data: number of events, number of iterations, said alpha and betaiWord pair set, dictionary size;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
outputting data: document-event distribution. The method specifically comprises the following steps:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
calculating the word pair b ═ wi、wj) Conditional probability distribution of (2):
Figure BDA0002728202760000111
wherein: z represents the event distribution of the word pair B, and z-B represents the event distribution of the word pair set B except the word pair B;
Figure BDA0002728202760000112
represents the number of times the event z is assigned to the word pair b;
Figure BDA0002728202760000113
indicating that event z is assigned to wiThe number of times of (c);
Figure BDA0002728202760000114
representing the number of times the word w is assigned to the event z; m denotes the dictionary size, i.e. the number of words that the original data set does not repeat.
Updating
Figure BDA0002728202760000115
S3.3.3, calculating and obtaining the event distribution p (z) and the event-word distribution p (w) of the original data setz):
Figure BDA0002728202760000116
Figure BDA0002728202760000117
In the above-mentioned two formulas, the first and second groups,
Figure BDA0002728202760000118
representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαIndicating the number of events
Figure BDA0002728202760000119
Indicating that event z is assigned to word wiThe number of times of (c);
Figure BDA00027282027600001110
representing the number of times the word w is assigned to the event z; m denotes a dictionary size.
S3.3.4, according to said p (z), p (w)z) The word pair-event distribution p (z | b) is obtained by calculation:
Figure BDA00027282027600001111
wherein, p (w)i| z) represents the word w corresponding to the event ziProbability distribution of p (w)j| z) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating the document-word pair distribution p (b | d) in the original data set:
Figure BDA00027282027600001112
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
Figure BDA0002728202760000121
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
The incident distribution includes: the document-event distribution, event-word distribution.
Mapping the word vector set into an event vector set to obtain an emergency distribution, and obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; the corresponding emergency word set is obtained through event-word distribution, and the example is as follows, as shown in tables 1 and 2:
TABLE 1
Figure BDA0002728202760000122
TABLE 2
Figure BDA0002728202760000123
Figure BDA0002728202760000131
And obtaining the optimal topic number K which is 3 according to the confusion, and respectively obtaining document-event distribution, event distribution and event-word distribution (only the first 3 words with the largest proportion are reserved).
And 4, constructing a BERT-BTM network according to the emergent event word set and the co-occurrence relation between words in the emergent event word set, and then dividing the BERT-BTM network to finish the emergent event detection.
The construction method of the BERT-BTM network is concretely as follows.
And constructing the BERT-BTM network by using the words in the emergency set obtained from the BERT-BTM model as points in the network and using the co-occurrence relation between the words in the emergency set as edges. The BERT-BTM network is represented using a data format NET file, which is commonly used in complex networks, and which defines all the points and edges in the network. The NET file comprises Vertics and Edges, wherein the Vertics describes nodes in the BERT-BTM network, and the Edges describe Edges between the nodes in the BERT-BTM network. Let { A, B, C } be the set of emergency words obtained from the microblog data set, and the set is represented in NET format, and the structure is shown in tables 3 and 4.
TABLE 3
Vertices
Node ID Node label
1 A
2 B
3 C
TABLE 4
Edges
Starting node ID Endpoint node ID
1 2
1 3
2 3
And integrating the emergency word sets obtained from the microblog data sets into a node set Verticieset and an edge set EdgeSet, and sequentially outputting the two sets to a NET file to obtain the BERT-BTM network, wherein the sets are shown in tables 5 and 6.
TABLE 5
Vertices
Node ID Node label
1 Huashi
2 Chip and method for manufacturing the same
3 Drunk wine
n Falling down
TABLE 6
Edges
Starting node ID Endpoint node ID
1 2
1 5
1 13
n 9
And partitioning the network by adopting a GN algorithm so as to discover the emergency. The specific method comprises the following steps:
the GN algorithm classifies the network by continuously removing the edge with the highest edge betweenness when executing the event detection task, and the GN algorithm flow is as follows:
sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted; after the emergency community is obtained through a GN algorithm, clustering n corresponding microblog events in the emergency word set by taking words (emergency word sets) in the same community as clustering center points to find microblog emergency clusters under the same microblog emergency. And when the similarity S between the microblog event and the microblog emergency word set is greater than a threshold value, the microblog is considered as the microblog describing the emergency by using a single-pass clustering method and calculating the similarity S between the microblog event and the microblog emergency word set.
Let C, H be two sets of words C ═ { C1, C2, C3, …, ct }, H ═ H1, H2, H3, …, hm }. When the similarity of two word sets is calculated, a function R is introduced(C,H)Representing the similarity of the word set C relative to the H, and the expression is as follows:
Figure BDA0002728202760000151
further, the similarity S between C and H is defined(C,H)Comprises the following steps:
Figure BDA0002728202760000152
similarity of H and C S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the microblog texts with the similarity larger than the threshold value are distributed to the same microblog emergency cluster to finish the detection of the microblog emergency. ResultsExamples are as follows:
clusters corresponding to each microblog (the clusters are represented by the labels 1-3, the label 1 corresponds to the first cluster, and so on):
the 1 st and 3 rd microblog descriptions of the event No. 1 are obtained; the 2 nd and 4 th microblogs describe the event No. 2; the 5 th microblog describes event number 3, as shown in table 7.
TABLE 7
Microblog numbering Reference numerals
1 1
2 2
3 1
4 2
5 3
The emergency described by each cluster is represented by several characteristic words, as shown in table 8, for example as follows:
TABLE 8
Figure BDA0002728202760000161
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and greatly improves the emergency detection efficiency. According to the technical scheme, more accurate microblog emergencies can be obtained, and meanwhile, the follow-up event clues can be tracked timely by related departments, and the fermentation of events can be controlled.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. The method for detecting the microblog emergency based on the BERT-BTM network is characterized by comprising the following steps of:
s1, reading a microblog data set, performing word segmentation processing on the microblog data set, and then removing stop words to obtain an original data set;
s2, vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the Dirichlet prior parameter alpha and the prior parameter beta fused with the BERT word vector setiConstructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, according to the co-occurrence relation between the words in the emergency word set and the words in the emergency word set, building a BERT-BTM network, and then dividing the BERT-BTM network to complete the detection of the emergency.
2. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the Dirichlet prior parameter alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to a prior parameter beta fused with the BERT word vector setiCalculating the event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair according to the event z and the event word distribution phii、wj
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into a BERT-BTM model to obtain output data;
the input data comprises the number of events, the number of iterations, the alpha, the betaiWord pair set, dictionary size;
the output data is an emergency distribution;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original data set does not repeat.
3. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps: the step S3.2 specifically includes:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining word pairs.
4. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:
the S3.2.3 obtains word pairs:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
(c) According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj)。
5. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:
the step S3.3 specifically includes:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w)z):
Figure FDA0002728202750000031
Figure FDA0002728202750000032
In the above-mentioned two formulas, the first and second groups,
Figure FDA0002728202750000033
representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαRepresenting the number of events;
Figure FDA0002728202750000034
indicating that event z is assigned to word wiThe number of times of (c);
Figure FDA0002728202750000035
representing the number of times the word w is assigned to the event z; m represents a dictionary size;
s3.3.4, according to said p (z), p (w)z) Get word pair-event distribution p (z | b):
Figure FDA0002728202750000036
wherein, p (w)i| z) represents the word w corresponding to the event ziProbability distribution of p (w)j| z) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating to obtain the document-word pair distribution p (b | d):
Figure FDA0002728202750000037
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating to obtain a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
Figure FDA0002728202750000041
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
6. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 5, wherein:
the incident distribution includes: the document-event distribution, event-word distribution;
obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; and obtaining a corresponding emergency word set through event-word distribution.
7. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network is represented by a NET file in a data format;
the words in the emergency word set are used as nodes in the network;
and taking the co-occurrence relation between the words in the emergency word set as the edges between the network nodes.
8. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;
the GN algorithm flow is as follows: sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
and the emergency word community takes the emergency word set as a clustering central point, and clusters the n corresponding microblog events in the emergency word set to obtain a final emergency cluster.
9. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 8, wherein:
the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
10. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 9, wherein:
the step of calculating the similarity S is as follows: let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R(C,H)
Figure FDA0002728202750000051
The similarity introducing function of the word set H relative to the C is R(H,C)
Figure FDA0002728202750000052
Similarity of C and H S(C,H)Comprises the following steps:
Figure FDA0002728202750000053
when the similarity between H and C is S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the event with the similarity larger than the threshold value is distributed to the same emergency cluster to complete the detection of the emergency.
CN202011109749.1A 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network Active CN112257429B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011109749.1A CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011109749.1A CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Publications (2)

Publication Number Publication Date
CN112257429A true CN112257429A (en) 2021-01-22
CN112257429B CN112257429B (en) 2024-04-16

Family

ID=74244414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011109749.1A Active CN112257429B (en) 2020-10-16 2020-10-16 Microblog emergency detection method based on BERT-BTM network

Country Status (1)

Country Link
CN (1) CN112257429B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032557A (en) * 2021-02-09 2021-06-25 北京工业大学 Microblog hot topic discovery method based on frequent word set and BERT semantics
CN117520484A (en) * 2024-01-04 2024-02-06 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104573031A (en) * 2015-01-14 2015-04-29 哈尔滨工业大学深圳研究生院 Micro blog emergency detection method
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN107273496A (en) * 2017-06-15 2017-10-20 淮海工学院 A kind of detection method of micro blog network region accident

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN104573031A (en) * 2015-01-14 2015-04-29 哈尔滨工业大学深圳研究生院 Micro blog emergency detection method
CN106611054A (en) * 2016-12-26 2017-05-03 电子科技大学 Method for extracting enterprise behavior or event from massive texts
CN107273496A (en) * 2017-06-15 2017-10-20 淮海工学院 A kind of detection method of micro blog network region accident

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴学谋;: "泛系运筹:时代变革和世界新的科技・军事・教育革命", 计算机与数字工程, no. 12 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032557A (en) * 2021-02-09 2021-06-25 北京工业大学 Microblog hot topic discovery method based on frequent word set and BERT semantics
CN113032557B (en) * 2021-02-09 2024-03-29 北京工业大学 Microblog hot topic discovery method based on frequent word sets and BERT semantics
CN117520484A (en) * 2024-01-04 2024-02-06 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics
CN117520484B (en) * 2024-01-04 2024-04-16 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics

Also Published As

Publication number Publication date
CN112257429B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
Prottasha et al. Transfer learning for sentiment analysis using BERT based supervised fine-tuning
Niu et al. LSTM-based VAE-GAN for time-series anomaly detection
Zhao et al. Image parsing with stochastic scene grammar
Khan et al. Toward smart lockdown: a novel approach for COVID-19 hotspots prediction using a deep hybrid neural network
Gorokhovatskyi et al. Analysis of application of cluster descriptions in space of characteristic image features
CN116682553B (en) Diagnosis recommendation system integrating knowledge and patient representation
Xie et al. Deep learning for natural language processing
CN110569920B (en) Prediction method for multi-task machine learning
CN111666350B (en) Medical text relation extraction method based on BERT model
CN112257429A (en) BERT-BTM network-based microblog emergency detection method
CN111368072A (en) Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
CN109448703A (en) In conjunction with the audio scene recognition method and system of deep neural network and topic model
CN108345633A (en) A kind of natural language processing method and device
CN114118416A (en) Variational graph automatic encoder method based on multi-task learning
CN115761275A (en) Unsupervised community discovery method and system based on graph neural network
Cilia et al. An experimental comparison between deep learning and classical machine learning approaches for writer identification in medieval documents
CN110175588B (en) Meta learning-based few-sample facial expression recognition method and system
CN117670571B (en) Incremental social media event detection method based on heterogeneous message graph relation embedding
Addo et al. Evae-net: An ensemble variational autoencoder deep learning network for covid-19 classification based on chest x-ray images
Cai et al. Da-gan: Dual attention generative adversarial network for cross-modal retrieval
CN114357022A (en) Media content association mining method based on event relation discovery
Kundana Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes.
Ling et al. Enhancing Chinese Address Parsing in Low-Resource Scenarios through In-Context Learning
Wang et al. Motif-based graph representation learning with application to chemical molecules
CN115587595A (en) Multi-granularity entity recognition method for pathological text naming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant