CN112257429A - BERT-BTM network-based microblog emergency detection method - Google Patents
BERT-BTM network-based microblog emergency detection method Download PDFInfo
- Publication number
- CN112257429A CN112257429A CN202011109749.1A CN202011109749A CN112257429A CN 112257429 A CN112257429 A CN 112257429A CN 202011109749 A CN202011109749 A CN 202011109749A CN 112257429 A CN112257429 A CN 112257429A
- Authority
- CN
- China
- Prior art keywords
- word
- event
- bert
- emergency
- btm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 5
- 230000011218 segmentation Effects 0.000 claims description 6
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 239000010931 gold Substances 0.000 description 3
- 229910052737 gold Inorganic materials 0.000 description 3
- 238000000855 fermentation Methods 0.000 description 2
- 230000004151 fermentation Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for detecting microblog emergency based on a BERT-BTM network, which comprises the steps of reading a microblog data set, processing the microblog data set and obtaining an original data set; vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the basic BERT word vector set to obtain a BERT word vector set; constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model; and constructing a BERT-BTM network, and then dividing the BERT-BTM network to finish the detection of the emergency. The method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and improves the emergency detection efficiency.
Description
Technical Field
The invention relates to the field of text detection, in particular to a microblog-oriented emergency identification method.
Background
With the rapid development of information technology in China, social network platforms generated by microblogs, Twitter, Facebook and the like have become main sources and important media for generating big data and emergencies, and the platforms become the first publishers of major emergencies such as natural disasters and terrorist incidents for many times. The emergent public events relate to social, political, economic and cultural fields of modern life and cover a plurality of issues of medical treatment, education, law, entertainment and the like. The detection of the emergency can not only improve the public attention, but also be beneficial to public opinion mining, emerging topic detection, topic clue tracking and other related applications. Based on the above description, it is significant to design a more accurate and effective method for detecting an emergency event on social network platforms such as a microblog.
The current microblog emergency detection task has several problems to be solved urgently: on one hand, the traditional method has the problems that the short text features are sparse and the word ambiguity cannot be solved. On the other hand, after the topic model is used for obtaining the event topic of the document, researchers usually use a clustering algorithm such as K-means, and the like, and the method has the problems that multiple iterations are needed, the number of clustering clusters needs to be specified, the efficiency is low, and the detection of an emergency cannot be completed quickly.
Disclosure of Invention
The invention aims to provide a BERT-BTM network-based microblog emergency detection method, which aims to solve the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method. The invention discloses a BERT-BTM network-based microblog emergency detection method, which comprises the following steps of:
s1, reading a microblog data set, performing word segmentation processing on the microblog data set, and then removing stop words to obtain an original data set;
s2, vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the Dirichlet prior parameter alpha and the prior parameter beta fused with the BERT word vector setiConstructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, according to the co-occurrence relation between the words in the emergency word set and the words in the emergency word set, building a BERT-BTM network, and then dividing the BERT-BTM network to complete the detection of the emergency.
Preferably, the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the Dirichlet prior parameter alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to a prior parameter beta fused with the BERT word vector setiCalculating the event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair according to the event z and the event word distribution phii、wj;
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into a BERT-BTM model to obtain output data;
the input data comprises the number of events, the number of iterations, the alpha, the betaiWord pair set, dictionary size;
the output data comprises an incident distribution;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original data set does not repeat.
Preferably, the step S3.2 specifically includes:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining probability distribution of the word pairs and the word pair sets.
Preferably, the S3.2.3 method for obtaining the probability distribution of the word pairs and the word pair set includes:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
(c) According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj);
Preferably, the step S3.3 specifically includes:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w) of the original data setz):
In the above-mentioned two formulas, the first and second groups,representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαRepresenting the number of events;indicating that event z is assigned to word wiThe number of times of (c);representing the number of times the word w is assigned to the event z; m represents a dictionary size; s3.3.4, according to said p (z), p (w)z) Get word pair-event distribution p (z | b):
wherein, p (w)iz) represents the word w corresponding to the event ziProbability distribution of p (w)jz) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating the document-word pair distribution p (b | d) in the original data set:
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating to obtain a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
Preferably, the emergency distribution includes: the document-event distribution, event-word distribution;
obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; and obtaining a corresponding emergency word set through event-word distribution.
Preferably, the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network is represented by a NET file in a data format;
the words in the emergency word set are used as nodes in the network;
and taking the co-occurrence relation between the words in the emergency word set as the edges between the network nodes.
Preferably, the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;
the GN algorithm flow is as follows:
sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
and the emergency word community takes the emergency word set as a clustering central point, and clusters the n corresponding microblog events in the emergency word set to obtain a final emergency cluster.
Preferably, the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
Preferably, the step of calculating the similarity S is as follows:
let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R(C,H):,
Sets of words H relative to CThe similarity introducing function is R(H,C):
Similarity of C and H S(C,H)Comprises the following steps:
when the similarity between H and C is S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the event with the similarity larger than the threshold value is distributed to the same emergency cluster to complete the detection of the emergency.
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and greatly improves the emergency detection efficiency. According to the technical scheme, more accurate microblog emergencies can be obtained, and meanwhile, the follow-up event clues can be tracked timely by related departments, and the fermentation of events can be controlled.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for detecting microblog emergency based on a BERT-BTM network;
FIG. 2 is a diagram of the structure of the BERT-BTM model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method for detecting microblog emergency based on Bert and BTM networks, which comprises the following steps:
step 1: reading a microblog data set, wherein the acquired data set comprises the following steps:
in 9, 15 days, including Taiwan power, high-pass, Samsung, SK, Haishi, Meiguang, etc., the chips are not supplied to Huacheng.
[ surprise! The drunk male overbridge falls and is caught by the driver through the car roof for 9 months and 13 days, and the drunk male in Wuhan climbs outside the overbridge, so that the condition is very critical. A van driver finds an abnormality, drives the vehicle below a man, and catches the man at the moment of falling.
"Hua is the first date of chip outage" -9.15 Ri American ban became effective, Niguann states that Hua would not be coreless.
The overpass of the drunken men is caught by the roof of the driver, Wuhan men hang the overpass with five meters, and at the critical moment, the citizen with great concentration stops the minibus to catch the overbridge men.
[ gold pink sunset in Beijing ] 15 days in the evening, and the sky in Beijing under the sun's illumination, the color of gold pink! Such sky really loves! This is almost always the end of summer!
Filtering noise data such as html label special characters contained in a text data set through regular matching to obtain a cleaned text sequence, performing word segmentation on the cleaned text sequence by using a word segmentation tool, and selecting an open source tool ICTCCLAS word segmentation system by using a word segmentation device to obtain a word segmented sequence; and then removing stop words in the microblog data set according to the stop word list, and storing the processed data set to obtain an original data set.
Raw data set:
hua ye/chip/outage/first day/china/chip/hundred million dollars/future/electricity/high pass/samsung/sea/lishi/beauty light/no more/supply/chip/give/hua ye
Fright/drunk/man/overpass/fall/driver/roof/catch/martial/drunk/man/climb to/overpass/out/situation/emergency/crisis/van/driver/sniff/abnormal/car/drive to/man/down/man/fall/instant/take/catch/car/man/down/man/fall/instant/go/catch/go
Hua is/chip/outage/first date/usa/ban/effect/nihonam/title/hua is/don't care/coreless/available
Drunk/man/overpass/fall/driver/roof/catch/wuhan/man/hang/five meters/high overpass/key/time/enthusiasm/citizen/stop/minibus/catch/fall bridge/man
Beijing/appeared/gold/pink/sunset/evening/Beijing/sky/sunset/shine/appeared/gold powder/color/sky/nearly/summer/ending
Step 2: vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain the BERT word vector set. And calling a pre-training BERT model through an API (application programming interface) at the client to obtain a BERT word vector set.
The BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog event.
Step 3, constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set:
s3.1, the BERT-BTM topic model shown in the figure 2 is proposed. Obtaining event distribution theta in the microblog data set according to a Dirichlet prior parameter alpha, and obtaining an event z corresponding to the event distribution theta according to the event distribution theta; according to prior parameter beta fused with the BERT word vector setiObtaining the event word distribution corresponding to the event zAccording to the event z and the event word scoreClothObtaining 2 different words w constituting a word pairi、wj(ii) a Event word distributionComposing an input event number k; event z and word wi、wjAnd forming a word pair set.
S3.2, processing the original data set by using a BERT-BTM model, which specifically comprises the following steps:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining word pairs:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj);
Calculating the conditional probability of the word pair b:
where p (b) is the conditional probability of word pair b, and p (z) ═ θzRepresenting the probability distribution, P (w), of an event zi|z)=φi|zWord w corresponding to the representation event ziProbability distribution of (1), P (w)j|z)=φjzWord w corresponding to the representation event zjProbability distribution of (2).
Calculating the probability of the word pair set B:
where P is the probability distribution of the set of word pairs B, θzRepresents the probability distribution, phi, of an event zi|zWord w corresponding to the representation event ziProbability distribution of (phi)j|zWord w corresponding to the representation event zjProbability distribution of (2).
Step S3.3BERT-Theta and φ of the BTM model were inferred using the Gibbs sampling method. The Gibbs sampling method is an efficient markov chain-monte carlo MCMC sampling method that uses the conditional distribution of each variable to achieve sampling in a joint distribution. The step of inferring the document-event distribution by the BERT-BTM model is as follows:
inputting data: number of events, number of iterations, said alpha and betaiWord pair set, dictionary size;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
outputting data: document-event distribution. The method specifically comprises the following steps:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
calculating the word pair b ═ wi、wj) Conditional probability distribution of (2):
wherein: z represents the event distribution of the word pair B, and z-B represents the event distribution of the word pair set B except the word pair B;represents the number of times the event z is assigned to the word pair b;indicating that event z is assigned to wiThe number of times of (c);representing the number of times the word w is assigned to the event z; m denotes the dictionary size, i.e. the number of words that the original data set does not repeat.
S3.3.3, calculating and obtaining the event distribution p (z) and the event-word distribution p (w) of the original data setz):
In the above-mentioned two formulas, the first and second groups,representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαIndicating the number of eventsIndicating that event z is assigned to word wiThe number of times of (c);representing the number of times the word w is assigned to the event z; m denotes a dictionary size.
S3.3.4, according to said p (z), p (w)z) The word pair-event distribution p (z | b) is obtained by calculation:
wherein, p (w)i| z) represents the word w corresponding to the event ziProbability distribution of p (w)j| z) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating the document-word pair distribution p (b | d) in the original data set:
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
The incident distribution includes: the document-event distribution, event-word distribution.
Mapping the word vector set into an event vector set to obtain an emergency distribution, and obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; the corresponding emergency word set is obtained through event-word distribution, and the example is as follows, as shown in tables 1 and 2:
TABLE 1
TABLE 2
And obtaining the optimal topic number K which is 3 according to the confusion, and respectively obtaining document-event distribution, event distribution and event-word distribution (only the first 3 words with the largest proportion are reserved).
And 4, constructing a BERT-BTM network according to the emergent event word set and the co-occurrence relation between words in the emergent event word set, and then dividing the BERT-BTM network to finish the emergent event detection.
The construction method of the BERT-BTM network is concretely as follows.
And constructing the BERT-BTM network by using the words in the emergency set obtained from the BERT-BTM model as points in the network and using the co-occurrence relation between the words in the emergency set as edges. The BERT-BTM network is represented using a data format NET file, which is commonly used in complex networks, and which defines all the points and edges in the network. The NET file comprises Vertics and Edges, wherein the Vertics describes nodes in the BERT-BTM network, and the Edges describe Edges between the nodes in the BERT-BTM network. Let { A, B, C } be the set of emergency words obtained from the microblog data set, and the set is represented in NET format, and the structure is shown in tables 3 and 4.
TABLE 3
Vertices
Node ID | Node label |
1 | A |
2 | B |
3 | C |
TABLE 4
Edges
Starting node ID | Endpoint node ID |
1 | 2 |
1 | 3 |
2 | 3 |
And integrating the emergency word sets obtained from the microblog data sets into a node set Verticieset and an edge set EdgeSet, and sequentially outputting the two sets to a NET file to obtain the BERT-BTM network, wherein the sets are shown in tables 5 and 6.
TABLE 5
Vertices
Node ID | Node label |
1 | Huashi |
2 | Chip and method for manufacturing the same |
3 | Drunk wine |
… | … |
n | Falling down |
TABLE 6
Edges
Starting node ID | Endpoint node ID |
1 | 2 |
1 | 5 |
1 | 13 |
… | … |
n | 9 |
And partitioning the network by adopting a GN algorithm so as to discover the emergency. The specific method comprises the following steps:
the GN algorithm classifies the network by continuously removing the edge with the highest edge betweenness when executing the event detection task, and the GN algorithm flow is as follows:
sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted; after the emergency community is obtained through a GN algorithm, clustering n corresponding microblog events in the emergency word set by taking words (emergency word sets) in the same community as clustering center points to find microblog emergency clusters under the same microblog emergency. And when the similarity S between the microblog event and the microblog emergency word set is greater than a threshold value, the microblog is considered as the microblog describing the emergency by using a single-pass clustering method and calculating the similarity S between the microblog event and the microblog emergency word set.
Let C, H be two sets of words C ═ { C1, C2, C3, …, ct }, H ═ H1, H2, H3, …, hm }. When the similarity of two word sets is calculated, a function R is introduced(C,H)Representing the similarity of the word set C relative to the H, and the expression is as follows:
further, the similarity S between C and H is defined(C,H)Comprises the following steps:
similarity of H and C S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the microblog texts with the similarity larger than the threshold value are distributed to the same microblog emergency cluster to finish the detection of the microblog emergency. ResultsExamples are as follows:
clusters corresponding to each microblog (the clusters are represented by the labels 1-3, the label 1 corresponds to the first cluster, and so on):
the 1 st and 3 rd microblog descriptions of the event No. 1 are obtained; the 2 nd and 4 th microblogs describe the event No. 2; the 5 th microblog describes event number 3, as shown in table 7.
TABLE 7
Microblog numbering | Reference numerals |
1 | 1 |
2 | 2 |
3 | 1 |
4 | 2 |
5 | 3 |
The emergency described by each cluster is represented by several characteristic words, as shown in table 8, for example as follows:
TABLE 8
The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and greatly improves the emergency detection efficiency. According to the technical scheme, more accurate microblog emergencies can be obtained, and meanwhile, the follow-up event clues can be tracked timely by related departments, and the fermentation of events can be controlled.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (10)
1. The method for detecting the microblog emergency based on the BERT-BTM network is characterized by comprising the following steps of:
s1, reading a microblog data set, performing word segmentation processing on the microblog data set, and then removing stop words to obtain an original data set;
s2, vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain a BERT word vector set;
the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;
s3, according to the Dirichlet prior parameter alpha and the prior parameter beta fused with the BERT word vector setiConstructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;
s4, according to the co-occurrence relation between the words in the emergency word set and the words in the emergency word set, building a BERT-BTM network, and then dividing the BERT-BTM network to complete the detection of the emergency.
2. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the step S3 includes:
s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the Dirichlet prior parameter alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;
according to a prior parameter beta fused with the BERT word vector setiCalculating the event word distribution phi corresponding to the event z;
calculating 2 different words w of a word pair according to the event z and the event word distribution phii、wj;
S3.2, processing the original data set by using a BERT-BTM model to form word pairs;
s3.3, inputting the input data into a BERT-BTM model to obtain output data;
the input data comprises the number of events, the number of iterations, the alpha, the betaiWord pair set, dictionary size;
the output data is an emergency distribution;
the input event number is the number of events z in the microblog data set;
the word pair set is a set of word pairs in the original data set;
the dictionary size is the number of words that the original data set does not repeat.
3. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps: the step S3.2 specifically includes:
s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);
s3.2.2, obtaining a word distribution φ of event zz:φz~Dir(βi);
S3.2.3, obtaining word pairs.
4. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:
the S3.2.3 obtains word pairs:
(a) obtaining an event z: z to Multi (θ);
(b) obtain word wi、wj:wi、wj~Multi(φz);
(c) According to the word wi、wjAnd obtaining a word pair b: b ═ wi、wj)。
5. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:
the step S3.3 specifically includes:
s3.3.1, randomly distributing a theme for the word pair b;
s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;
s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w)z):
In the above-mentioned two formulas, the first and second groups,representing the number of times the event z is assigned to the word pair b; n isbRepresenting the number of word pairs in the original data set; t isαRepresenting the number of events;indicating that event z is assigned to word wiThe number of times of (c);representing the number of times the word w is assigned to the event z; m represents a dictionary size;
s3.3.4, according to said p (z), p (w)z) Get word pair-event distribution p (z | b):
wherein, p (w)i| z) represents the word w corresponding to the event ziProbability distribution of p (w)j| z) represents the word w corresponding to the event zjA probability distribution of (a);
s3.3.5, calculating to obtain the document-word pair distribution p (b | d):
wherein n isd(b)Is the frequency of occurrence of word pair b in document d;
the document d and the original data set are the same data set;
s3.3.6, calculating to obtain a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):
where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.
6. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 5, wherein:
the incident distribution includes: the document-event distribution, event-word distribution;
obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; and obtaining a corresponding emergency word set through event-word distribution.
7. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the method for constructing the BERT-BTM network comprises the following steps:
the BERT-BTM network is represented by a NET file in a data format;
the words in the emergency word set are used as nodes in the network;
and taking the co-occurrence relation between the words in the emergency word set as the edges between the network nodes.
8. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:
the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;
the GN algorithm flow is as follows: sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;
and the emergency word community takes the emergency word set as a clustering central point, and clusters the n corresponding microblog events in the emergency word set to obtain a final emergency cluster.
9. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 8, wherein:
the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.
10. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 9, wherein:
the step of calculating the similarity S is as follows: let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R(C,H):
The similarity introducing function of the word set H relative to the C is R(H,C):
Similarity of C and H S(C,H)Comprises the following steps:
when the similarity between H and C is S(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the event with the similarity larger than the threshold value is distributed to the same emergency cluster to complete the detection of the emergency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011109749.1A CN112257429B (en) | 2020-10-16 | 2020-10-16 | Microblog emergency detection method based on BERT-BTM network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011109749.1A CN112257429B (en) | 2020-10-16 | 2020-10-16 | Microblog emergency detection method based on BERT-BTM network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112257429A true CN112257429A (en) | 2021-01-22 |
CN112257429B CN112257429B (en) | 2024-04-16 |
Family
ID=74244414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011109749.1A Active CN112257429B (en) | 2020-10-16 | 2020-10-16 | Microblog emergency detection method based on BERT-BTM network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112257429B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113032557A (en) * | 2021-02-09 | 2021-06-25 | 北京工业大学 | Microblog hot topic discovery method based on frequent word set and BERT semantics |
CN117520484A (en) * | 2024-01-04 | 2024-02-06 | 中国电子科技集团公司第十五研究所 | Similar event retrieval method, system, equipment and medium based on big data semantics |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN104573031A (en) * | 2015-01-14 | 2015-04-29 | 哈尔滨工业大学深圳研究生院 | Micro blog emergency detection method |
CN106611054A (en) * | 2016-12-26 | 2017-05-03 | 电子科技大学 | Method for extracting enterprise behavior or event from massive texts |
CN107273496A (en) * | 2017-06-15 | 2017-10-20 | 淮海工学院 | A kind of detection method of micro blog network region accident |
-
2020
- 2020-10-16 CN CN202011109749.1A patent/CN112257429B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN104573031A (en) * | 2015-01-14 | 2015-04-29 | 哈尔滨工业大学深圳研究生院 | Micro blog emergency detection method |
CN106611054A (en) * | 2016-12-26 | 2017-05-03 | 电子科技大学 | Method for extracting enterprise behavior or event from massive texts |
CN107273496A (en) * | 2017-06-15 | 2017-10-20 | 淮海工学院 | A kind of detection method of micro blog network region accident |
Non-Patent Citations (1)
Title |
---|
吴学谋;: "泛系运筹:时代变革和世界新的科技・军事・教育革命", 计算机与数字工程, no. 12 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113032557A (en) * | 2021-02-09 | 2021-06-25 | 北京工业大学 | Microblog hot topic discovery method based on frequent word set and BERT semantics |
CN113032557B (en) * | 2021-02-09 | 2024-03-29 | 北京工业大学 | Microblog hot topic discovery method based on frequent word sets and BERT semantics |
CN117520484A (en) * | 2024-01-04 | 2024-02-06 | 中国电子科技集团公司第十五研究所 | Similar event retrieval method, system, equipment and medium based on big data semantics |
CN117520484B (en) * | 2024-01-04 | 2024-04-16 | 中国电子科技集团公司第十五研究所 | Similar event retrieval method, system, equipment and medium based on big data semantics |
Also Published As
Publication number | Publication date |
---|---|
CN112257429B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Prottasha et al. | Transfer learning for sentiment analysis using BERT based supervised fine-tuning | |
Niu et al. | LSTM-based VAE-GAN for time-series anomaly detection | |
Zhao et al. | Image parsing with stochastic scene grammar | |
Khan et al. | Toward smart lockdown: a novel approach for COVID-19 hotspots prediction using a deep hybrid neural network | |
Gorokhovatskyi et al. | Analysis of application of cluster descriptions in space of characteristic image features | |
CN116682553B (en) | Diagnosis recommendation system integrating knowledge and patient representation | |
Xie et al. | Deep learning for natural language processing | |
CN110569920B (en) | Prediction method for multi-task machine learning | |
CN111666350B (en) | Medical text relation extraction method based on BERT model | |
CN112257429A (en) | BERT-BTM network-based microblog emergency detection method | |
CN111368072A (en) | Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity | |
CN109448703A (en) | In conjunction with the audio scene recognition method and system of deep neural network and topic model | |
CN108345633A (en) | A kind of natural language processing method and device | |
CN114118416A (en) | Variational graph automatic encoder method based on multi-task learning | |
CN115761275A (en) | Unsupervised community discovery method and system based on graph neural network | |
Cilia et al. | An experimental comparison between deep learning and classical machine learning approaches for writer identification in medieval documents | |
CN110175588B (en) | Meta learning-based few-sample facial expression recognition method and system | |
CN117670571B (en) | Incremental social media event detection method based on heterogeneous message graph relation embedding | |
Addo et al. | Evae-net: An ensemble variational autoencoder deep learning network for covid-19 classification based on chest x-ray images | |
Cai et al. | Da-gan: Dual attention generative adversarial network for cross-modal retrieval | |
CN114357022A (en) | Media content association mining method based on event relation discovery | |
Kundana | Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes. | |
Ling et al. | Enhancing Chinese Address Parsing in Low-Resource Scenarios through In-Context Learning | |
Wang et al. | Motif-based graph representation learning with application to chemical molecules | |
CN115587595A (en) | Multi-granularity entity recognition method for pathological text naming |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |