CN112257429A

CN112257429A - BERT-BTM network-based microblog emergency detection method

Info

Publication number: CN112257429A
Application number: CN202011109749.1A
Authority: CN
Inventors: 韩忠明; 黄楚蓉; 段大高; 张翙; 李俊
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2021-01-22
Anticipated expiration: 2040-10-16
Also published as: CN112257429B

Abstract

The invention discloses a method for detecting microblog emergency based on a BERT-BTM network, which comprises the steps of reading a microblog data set, processing the microblog data set and obtaining an original data set; vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the basic BERT word vector set to obtain a BERT word vector set; constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model; and constructing a BERT-BTM network, and then dividing the BERT-BTM network to finish the detection of the emergency. The method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and improves the emergency detection efficiency.

Description

BERT-BTM network-based microblog emergency detection method

Technical Field

The invention relates to the field of text detection, in particular to a microblog-oriented emergency identification method.

Background

With the rapid development of information technology in China, social network platforms generated by microblogs, Twitter, Facebook and the like have become main sources and important media for generating big data and emergencies, and the platforms become the first publishers of major emergencies such as natural disasters and terrorist incidents for many times. The emergent public events relate to social, political, economic and cultural fields of modern life and cover a plurality of issues of medical treatment, education, law, entertainment and the like. The detection of the emergency can not only improve the public attention, but also be beneficial to public opinion mining, emerging topic detection, topic clue tracking and other related applications. Based on the above description, it is significant to design a more accurate and effective method for detecting an emergency event on social network platforms such as a microblog.

The current microblog emergency detection task has several problems to be solved urgently: on one hand, the traditional method has the problems that the short text features are sparse and the word ambiguity cannot be solved. On the other hand, after the topic model is used for obtaining the event topic of the document, researchers usually use a clustering algorithm such as K-means, and the like, and the method has the problems that multiple iterations are needed, the number of clustering clusters needs to be specified, the efficiency is low, and the detection of an emergency cannot be completed quickly.

Disclosure of Invention

The invention aims to provide a BERT-BTM network-based microblog emergency detection method, which aims to solve the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method. The invention discloses a BERT-BTM network-based microblog emergency detection method, which comprises the following steps of:

s1, reading a microblog data set, performing word segmentation processing on the microblog data set, and then removing stop words to obtain an original data set;

s2, vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain a BERT word vector set;

the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;

s3, according to the Dirichlet prior parameter alpha and the prior parameter beta fused with the BERT word vector set_iConstructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;

s4, according to the co-occurrence relation between the words in the emergency word set and the words in the emergency word set, building a BERT-BTM network, and then dividing the BERT-BTM network to complete the detection of the emergency.

Preferably, the step S3 includes:

s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the Dirichlet prior parameter alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;

according to a prior parameter beta fused with the BERT word vector set_iCalculating the event word distribution phi corresponding to the event z;

calculating 2 different words w of a word pair according to the event z and the event word distribution phi_i、w_j；

S3.2, processing the original data set by using a BERT-BTM model to form word pairs;

s3.3, inputting the input data into a BERT-BTM model to obtain output data;

the input data comprises the number of events, the number of iterations, the alpha, the beta_iWord pair set, dictionary size;

the output data comprises an incident distribution;

the input event number is the number of events z in the microblog data set;

the word pair set is a set of word pairs in the original data set;

the dictionary size is the number of words that the original data set does not repeat.

Preferably, the step S3.2 specifically includes:

s3.2.1, obtaining an event distribution theta of the microblog data set: theta to Dir (alpha);

s3.2.2, obtaining a word distribution φ of event z_z：φ_z～Dir(β_i)；

S3.2.3, obtaining probability distribution of the word pairs and the word pair sets.

Preferably, the S3.2.3 method for obtaining the probability distribution of the word pairs and the word pair set includes:

(a) obtaining an event z: z to Multi (θ);

(b) obtain word w_i、w_j：w_i、w_j～Multi(φ_z)；

(c) According to the word w_i、w_jAnd obtaining a word pair b: b ═ w_i、w_j)；

Preferably, the step S3.3 specifically includes:

s3.3.1, randomly distributing a theme for the word pair b;

s3.3.2, carrying out N iterations, and processing each word pair B in the word pair set B;

s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w) of the original data set_z)：

In the above-mentioned two formulas, the first and second groups,

representing the number of times the event z is assigned to the word pair b; n is_bRepresenting the number of word pairs in the original data set; t is_αRepresenting the number of events;

indicating that event z is assigned to word w_iThe number of times of (c);

representing the number of times the word w is assigned to the event z; m represents a dictionary size; s3.3.4, according to said p (z), p (w)_z) Get word pair-event distribution p (z | b):

wherein, p (w)_iz) represents the word w corresponding to the event z_iProbability distribution of p (w)_jz) represents the word w corresponding to the event z_jA probability distribution of (a);

s3.3.5, calculating the document-word pair distribution p (b | d) in the original data set:

wherein n is_d(b)Is the frequency of occurrence of word pair b in document d;

the document d and the original data set are the same data set;

s3.3.6, calculating to obtain a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):

where P (z | b) is the word pair-event distribution, P (b | d) is the document-word pair distribution, and P (z | d) is the document-event distribution.

Preferably, the emergency distribution includes: the document-event distribution, event-word distribution;

obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; and obtaining a corresponding emergency word set through event-word distribution.

Preferably, the method for constructing the BERT-BTM network comprises the following steps:

the BERT-BTM network is represented by a NET file in a data format;

the words in the emergency word set are used as nodes in the network;

and taking the co-occurrence relation between the words in the emergency word set as the edges between the network nodes.

Preferably, the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;

the GN algorithm flow is as follows:

sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;

and the emergency word community takes the emergency word set as a clustering central point, and clusters the n corresponding microblog events in the emergency word set to obtain a final emergency cluster.

Preferably, the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.

Preferably, the step of calculating the similarity S is as follows:

let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R_(C,H)：，

Sets of words H relative to CThe similarity introducing function is R_(H,C)：

Similarity of C and H S_(C,H)Comprises the following steps:

when the similarity between H and C is S_(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the event with the similarity larger than the threshold value is distributed to the same emergency cluster to complete the detection of the emergency.

The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the conventional microblog emergency detection method, and greatly improves the emergency detection efficiency. According to the technical scheme, more accurate microblog emergencies can be obtained, and meanwhile, the follow-up event clues can be tracked timely by related departments, and the fermentation of events can be controlled.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for detecting microblog emergency based on a BERT-BTM network;

FIG. 2 is a diagram of the structure of the BERT-BTM model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a method for detecting microblog emergency based on Bert and BTM networks, which comprises the following steps:

step 1: reading a microblog data set, wherein the acquired data set comprises the following steps:

in 9, 15 days, including Taiwan power, high-pass, Samsung, SK, Haishi, Meiguang, etc., the chips are not supplied to Huacheng.

[ surprise! The drunk male overbridge falls and is caught by the driver through the car roof for 9 months and 13 days, and the drunk male in Wuhan climbs outside the overbridge, so that the condition is very critical. A van driver finds an abnormality, drives the vehicle below a man, and catches the man at the moment of falling.

"Hua is the first date of chip outage" -9.15 Ri American ban became effective, Niguann states that Hua would not be coreless.

The overpass of the drunken men is caught by the roof of the driver, Wuhan men hang the overpass with five meters, and at the critical moment, the citizen with great concentration stops the minibus to catch the overbridge men.

[ gold pink sunset in Beijing ] 15 days in the evening, and the sky in Beijing under the sun's illumination, the color of gold pink! Such sky really loves! This is almost always the end of summer!

Filtering noise data such as html label special characters contained in a text data set through regular matching to obtain a cleaned text sequence, performing word segmentation on the cleaned text sequence by using a word segmentation tool, and selecting an open source tool ICTCCLAS word segmentation system by using a word segmentation device to obtain a word segmented sequence; and then removing stop words in the microblog data set according to the stop word list, and storing the processed data set to obtain an original data set.

Raw data set:

hua ye/chip/outage/first day/china/chip/hundred million dollars/future/electricity/high pass/samsung/sea/lishi/beauty light/no more/supply/chip/give/hua ye

Fright/drunk/man/overpass/fall/driver/roof/catch/martial/drunk/man/climb to/overpass/out/situation/emergency/crisis/van/driver/sniff/abnormal/car/drive to/man/down/man/fall/instant/take/catch/car/man/down/man/fall/instant/go/catch/go

Hua is/chip/outage/first date/usa/ban/effect/nihonam/title/hua is/don't care/coreless/available

Drunk/man/overpass/fall/driver/roof/catch/wuhan/man/hang/five meters/high overpass/key/time/enthusiasm/citizen/stop/minibus/catch/fall bridge/man

Beijing/appeared/gold/pink/sunset/evening/Beijing/sky/sunset/shine/appeared/gold powder/color/sky/nearly/summer/ending

Step 2: vectorizing the original data set to obtain a vectorized word vector set, and then calling a pre-training BERT model to process the vectorized word vector set to obtain the BERT word vector set. And calling a pre-training BERT model through an API (application programming interface) at the client to obtain a BERT word vector set.

The BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog event.

Step 3, constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set:

s3.1, the BERT-BTM topic model shown in the figure 2 is proposed. Obtaining event distribution theta in the microblog data set according to a Dirichlet prior parameter alpha, and obtaining an event z corresponding to the event distribution theta according to the event distribution theta; according to prior parameter beta fused with the BERT word vector set_iObtaining the event word distribution corresponding to the event z

According to the event z and the event word scoreCloth

Obtaining 2 different words w constituting a word pair_i、w_j(ii) a Event word distribution

Composing an input event number k; event z and word w_i、w_jAnd forming a word pair set.

S3.2, processing the original data set by using a BERT-BTM model, which specifically comprises the following steps:

s3.2.2, obtaining a word distribution φ of event z_z：φ_z～Dir(β_i)；

S3.2.3, obtaining word pairs:

(a) obtaining an event z: z to Multi (θ);

(b) obtain word w_i、w_j：w_i、w_j～Multi(φ_z)；

According to the word w_i、w_jAnd obtaining a word pair b: b ═ w_i、w_j)；

Calculating the conditional probability of the word pair b:

where p (b) is the conditional probability of word pair b, and p (z) ═ θ_zRepresenting the probability distribution, P (w), of an event z_i|z)＝φ_i|zWord w corresponding to the representation event z_iProbability distribution of (1), P (w)_j|z)＝φ_jzWord w corresponding to the representation event z_jProbability distribution of (2).

Calculating the probability of the word pair set B:

where P is the probability distribution of the set of word pairs B, θ_zRepresents the probability distribution, phi, of an event z_i|zWord w corresponding to the representation event z_iProbability distribution of (phi)_j|zWord w corresponding to the representation event z_jProbability distribution of (2).

Step S3.3BERT-Theta and φ of the BTM model were inferred using the Gibbs sampling method. The Gibbs sampling method is an efficient markov chain-monte carlo MCMC sampling method that uses the conditional distribution of each variable to achieve sampling in a joint distribution. The step of inferring the document-event distribution by the BERT-BTM model is as follows:

inputting data: number of events, number of iterations, said alpha and beta_iWord pair set, dictionary size;

the input event number is the number of events z in the microblog data set;

the word pair set is a set of word pairs in the original data set;

outputting data: document-event distribution. The method specifically comprises the following steps:

s3.3.1, randomly distributing a theme for the word pair b;

calculating the word pair b ═ w_i、w_j) Conditional probability distribution of (2):

wherein: z represents the event distribution of the word pair B, and z-B represents the event distribution of the word pair set B except the word pair B;

represents the number of times the event z is assigned to the word pair b;

indicating that event z is assigned to w_iThe number of times of (c);

representing the number of times the word w is assigned to the event z; m denotes the dictionary size, i.e. the number of words that the original data set does not repeat.

Updating

S3.3.3, calculating and obtaining the event distribution p (z) and the event-word distribution p (w) of the original data set_z)：

In the above-mentioned two formulas, the first and second groups,

representing the number of times the event z is assigned to the word pair b; n is_bRepresenting the number of word pairs in the original data set; t is_αIndicating the number of events

Indicating that event z is assigned to word w_iThe number of times of (c);

representing the number of times the word w is assigned to the event z; m denotes a dictionary size.

S3.3.4, according to said p (z), p (w)_z) The word pair-event distribution p (z | b) is obtained by calculation:

wherein, p (w)_i| z) represents the word w corresponding to the event z_iProbability distribution of p (w)_j| z) represents the word w corresponding to the event z_jA probability distribution of (a);

wherein n is_d(b)Is the frequency of occurrence of word pair b in document d;

the document d and the original data set are the same data set;

s3.3.6, calculating a document-event distribution P (z | d) according to the word pair-event distribution P (z | b) and the document-word pair distribution P (b | d):

The incident distribution includes: the document-event distribution, event-word distribution.

Mapping the word vector set into an event vector set to obtain an emergency distribution, and obtaining words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; the corresponding emergency word set is obtained through event-word distribution, and the example is as follows, as shown in tables 1 and 2:

TABLE 1

TABLE 2

And obtaining the optimal topic number K which is 3 according to the confusion, and respectively obtaining document-event distribution, event distribution and event-word distribution (only the first 3 words with the largest proportion are reserved).

And 4, constructing a BERT-BTM network according to the emergent event word set and the co-occurrence relation between words in the emergent event word set, and then dividing the BERT-BTM network to finish the emergent event detection.

The construction method of the BERT-BTM network is concretely as follows.

And constructing the BERT-BTM network by using the words in the emergency set obtained from the BERT-BTM model as points in the network and using the co-occurrence relation between the words in the emergency set as edges. The BERT-BTM network is represented using a data format NET file, which is commonly used in complex networks, and which defines all the points and edges in the network. The NET file comprises Vertics and Edges, wherein the Vertics describes nodes in the BERT-BTM network, and the Edges describe Edges between the nodes in the BERT-BTM network. Let { A, B, C } be the set of emergency words obtained from the microblog data set, and the set is represented in NET format, and the structure is shown in tables 3 and 4.

TABLE 3

Vertices

Node ID	Node label
		1	A
2	B
		3	C

TABLE 4

Edges

Starting node ID	Endpoint node ID
		1	2
1	3
		2	3

And integrating the emergency word sets obtained from the microblog data sets into a node set Verticieset and an edge set EdgeSet, and sequentially outputting the two sets to a NET file to obtain the BERT-BTM network, wherein the sets are shown in tables 5 and 6.

TABLE 5

Vertices

Node ID	Node label
		1	Huashi
2	Chip and method for manufacturing the same
		3	Drunk wine
…	…
		n	Falling down

TABLE 6

Edges

Starting node ID	Endpoint node ID
		1	2
1	5
		1	13
…	…
		n	9

And partitioning the network by adopting a GN algorithm so as to discover the emergency. The specific method comprises the following steps:

the GN algorithm classifies the network by continuously removing the edge with the highest edge betweenness when executing the event detection task, and the GN algorithm flow is as follows:

sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted; after the emergency community is obtained through a GN algorithm, clustering n corresponding microblog events in the emergency word set by taking words (emergency word sets) in the same community as clustering center points to find microblog emergency clusters under the same microblog emergency. And when the similarity S between the microblog event and the microblog emergency word set is greater than a threshold value, the microblog is considered as the microblog describing the emergency by using a single-pass clustering method and calculating the similarity S between the microblog event and the microblog emergency word set.

Let C, H be two sets of words C ═ { C1, C2, C3, …, ct }, H ═ H1, H2, H3, …, hm }. When the similarity of two word sets is calculated, a function R is introduced_(C,H)Representing the similarity of the word set C relative to the H, and the expression is as follows:

further, the similarity S between C and H is defined_(C,H)Comprises the following steps:

similarity of H and C S_(C,H)And when the similarity is larger than a certain threshold value, the H and the C are considered to be similar, and the microblog texts with the similarity larger than the threshold value are distributed to the same microblog emergency cluster to finish the detection of the microblog emergency. ResultsExamples are as follows:

clusters corresponding to each microblog (the clusters are represented by the labels 1-3, the label 1 corresponds to the first cluster, and so on):

the 1 st and 3 rd microblog descriptions of the event No. 1 are obtained; the 2 nd and 4 th microblogs describe the event No. 2; the 5 th microblog describes event number 3, as shown in table 7.

TABLE 7

Microblog numbering	Reference numerals
		1	1
2	2
		3	1
4	2
		5	3

The emergency described by each cluster is represented by several characteristic words, as shown in table 8, for example as follows:

TABLE 8

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for detecting the microblog emergency based on the BERT-BTM network is characterized by comprising the following steps of:

2. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:

the step S3 includes:

s3.3, inputting the input data into a BERT-BTM model to obtain output data;

the output data is an emergency distribution;

the input event number is the number of events z in the microblog data set;

the word pair set is a set of word pairs in the original data set;

3. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps: the step S3.2 specifically includes:

s3.2.2, obtaining a word distribution φ of event z_z：φ_z～Dir(β_i)；

S3.2.3, obtaining word pairs.

4. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:

the S3.2.3 obtains word pairs:

(a) obtaining an event z: z to Multi (θ);

(b) obtain word w_i、w_j：w_i、w_j～Multi(φ_z)；

(c) According to the word w_i、w_jAnd obtaining a word pair b: b ═ w_i、w_j)。

5. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 2, wherein the method comprises the following steps:

the step S3.3 specifically includes:

s3.3.1, randomly distributing a theme for the word pair b;

s3.3.3, calculating an event distribution p (z) and an event-word distribution p (w)_z)：

In the above-mentioned two formulas, the first and second groups,

indicating that event z is assigned to word w_iThe number of times of (c);

representing the number of times the word w is assigned to the event z; m represents a dictionary size;

s3.3.4, according to said p (z), p (w)_z) Get word pair-event distribution p (z | b):

s3.3.5, calculating to obtain the document-word pair distribution p (b | d):

wherein n is_d(b)Is the frequency of occurrence of word pair b in document d;

the document d and the original data set are the same data set;

6. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 5, wherein:

the incident distribution includes: the document-event distribution, event-word distribution;

7. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:

the method for constructing the BERT-BTM network comprises the following steps:

the BERT-BTM network is represented by a NET file in a data format;

the words in the emergency word set are used as nodes in the network;

8. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 1, wherein the method comprises the following steps:

the method for dividing the BERT-BTM network comprises the following steps: continuously removing the edge with the highest edge betweenness by using a GN algorithm to divide the BERT-BTM network;

the GN algorithm flow is as follows: sequentially calculating edge betweenness of each edge in the BERT-BTM network to be mined; finding out an edge with the maximum edge betweenness in the BERT-BTM network and then deleting the edge; recalculating edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;

9. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 8, wherein:

the clustering method is unilateral clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is greater than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.

10. The method for detecting the microblog emergency based on the BERT-BTM network as claimed in claim 9, wherein:

the step of calculating the similarity S is as follows: let two word sets be represented by C, H, and the similarity of the word set C relative to H is introduced into the function as R_(C,H)：

The similarity introducing function of the word set H relative to the C is R_(H,C)：

Similarity of C and H S_(C,H)Comprises the following steps: