CN112257429B

CN112257429B - Microblog emergency detection method based on BERT-BTM network

Info

Publication number: CN112257429B
Application number: CN202011109749.1A
Authority: CN
Inventors: 韩忠明; 黄楚蓉; 段大高; 张翙; 李俊
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2024-04-16
Anticipated expiration: 2040-10-16
Also published as: CN112257429A

Abstract

The invention discloses a microblog emergency detection method based on a BERT-BTM network, which comprises the steps of reading a microblog data set, and processing the microblog data set to obtain an original data set; carrying out vectorization processing on the original data set to obtain a vectorized word vector set, and then processing the basic BERT word vector set by calling a pre-training BERT model to obtain a BERT word vector set; constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model; and constructing a BERT-BTM network, and dividing the BERT-BTM network to finish the emergency detection. The method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the existing microblog emergency detection method, and improves the emergency detection efficiency.

Description

Microblog emergency detection method based on BERT-BTM network

Technical Field

The invention relates to the field of text detection, in particular to a microblog-oriented emergency recognition method.

Background

The emergent public event relates to a plurality of fields of society, politics, economy, culture and the like of modern life, and covers a plurality of issues of medical treatment, education, law, entertainment and the like. The detection of the emergency event not only can improve public attention, but also is beneficial to related applications such as public opinion mining, emerging topic detection, topic clue tracking and the like. Based on the description, it is significant to design a more accurate and effective method for detecting the emergency event of the social network platform such as the microblog.

The current microblog emergency detection task has several problems to be solved urgently: on one hand, the traditional method has the problem that short text features are sparse and the word ambiguity cannot be solved. On the other hand, after obtaining the event topic of the document by using the topic model, researchers usually use clustering algorithms such as K-means, and the like, and the method has the problems that a plurality of iterations are needed, the number of clusters is needed to be designated, the efficiency is low, and the emergency detection cannot be completed quickly.

Disclosure of Invention

The invention aims to provide a microblog emergency detection method based on a BERT-BTM network, which aims to solve the problems that short text data are sparse and word ambiguity cannot be solved in the existing microblog emergency detection method. The invention discloses a microblog emergency detection method based on a BERT-BTM network, which comprises the following steps:

s1, reading a microblog data set, performing word segmentation on the microblog data set, and then removing stop words to obtain an original data set;

s2, carrying out vectorization on the original data set to obtain a vectorized word vector set, and then, carrying out processing on the vectorized word vector set by calling a pre-training BERT model to obtain a BERT word vector set;

the BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog;

s3, according to the dirichlet priori parameter alpha and the priori parameter beta fused with the BERT word vector set _i Constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set;

s4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection.

Preferably, the step S3 includes:

s3.1, constructing a BERT-BTM model: calculating event distribution theta in the microblog data set according to the dirichlet priors alpha, and calculating an event z corresponding to the event distribution theta according to the event distribution theta;

according to the prior parameter beta fused with the BERT word vector set _i Calculating event word distribution phi corresponding to the event z;

calculating 2 different words w of a word pair based on the event z and the event word distribution phi _i w _j

S3.2, processing the original data set by using a BERT-BTM model to form word pairs;

s3.3, inputting the input data into the BERT-BTM model to obtain output data;

the input data includes the number of events, the number of iterations, the alpha, the beta _i Word pair set, dictionary size;

the output data includes an incident profile;

the number of the input events is the number of the events z in the microblog data set;

the word pair set is a set of word pairs in the original data set;

the dictionary size is the number of words that the original dataset does not repeat.

Preferably, the step S3.2 specifically includes:

s3.2.1, obtaining event distribution of the microblog data set: to Dir ();

s3.2.2 obtaining the word distribution of event z _z _z Dir( _i )

S3.2.3 obtaining a probability distribution of word pairs and word pair sets.

Preferably, the S3.2.3 method for obtaining the probability distribution of the word pairs and the word pair sets is as follows:

(a) Obtaining event z: z-Multi ();

(b) Obtaining word w _i w _j w _i w _j Multi( _z )

(c) According to the word w _i w _j Obtaining a word pair b: b= (w) _i w _j )

Preferably, the step S3.3 specifically includes:

s3.3.1 randomly assigning a topic to the word pair b;

s3.3.2, performing N iterations, and processing each word pair B of the word pair set B;

s3.3.3 calculating an event distribution p (z) and an event-word distribution p (w) of the raw dataset _z )

In the two formulas described above, the first and second compounds,representing the number of times event z is assigned to word pair b; n is n _b Representing the number of word pairs in the original dataset; t (T) Representing the number of events; />Indicating that event z is assigned to word w _i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size;

s3.3.4 according to said p (z), p (w _z ) Obtaining a word pair-event distribution p (z|b):

wherein p (w) _i Z) represents the word w corresponding to event z _i Probability distribution of p (w) _j Z) represents the word w corresponding to event z _j Probability distribution of (2);

s3.3.5, calculating a document-word pair distribution p (b|d) in the original dataset:

wherein n is _d(b) Is the frequency with which word pair b appears in document d;

the document d and the original data set are the same data set;

s3.3.6, calculating a document-event distribution P (z|d) according to the word pair-event distribution P (z|b) and the document-word pair distribution P (b|d):

where P (z|b) is a word pair-event distribution, P (b|d) is a document-word pair distribution, and P (z|d) is a document-event distribution.

Preferably, the emergency distribution includes: the document-event distribution, event-word distribution;

according to the emergency distribution, obtaining words in the corresponding emergency word set of the current document through document-event distribution; and obtaining a corresponding emergency word set through event-word distribution.

Preferably, the method for constructing the BERT-BTM network comprises the following steps:

the BERT-BTM network uses a data format NET file representation;

the words in the emergency word set are used as nodes in the network;

and the co-occurrence relation among the words in the emergency word set is used as an edge among the network nodes.

Preferably, the method for dividing the BERT-BTM network comprises the following steps: using GN algorithm to remove the edge with highest edge medium number continuously and divide the BERT-BTM network;

the GN algorithm flow is as follows:

sequentially calculating the edge betweenness of each edge in the BERT-BTM network to be mined; finding the edge with the largest edge betweenness in the BERT-BTM network and then deleting the edge; recalculating the edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;

the emergency word community clusters the corresponding n microblog events in the emergency word set by taking the emergency word set as a clustering center point to obtain a final emergency cluster.

Preferably, the clustering method is single-side clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is larger than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.

Preferably, the step of calculating the similarity S is as follows:

setting two word sets to be C, H, and introducing similarity introduction function of word set C relative to H as R _(C,H)

The similarity introducing function of the word set H relative to C is R _(H,C)

C and H similarity S _(C,H) The method comprises the following steps:

when the similarity between H and C is S _(C,H) When the threshold value is larger than the threshold value, the H and the C are considered to be similar, and the events with the similarity larger than the threshold value are distributed into the same emergency cluster to finish the detection of the emergency.

The invention has the following beneficial effects: the method solves the problems that short text data are sparse and the word ambiguity cannot be solved in the existing microblog emergency detection method, and greatly improves the emergency detection efficiency. The technical scheme of the invention can acquire more accurate microblog emergency, and is beneficial to relevant departments to timely track subsequent event clues and control the fermentation of the event while acquiring more accurate microblog emergency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for detecting microblog emergency events based on a BERT-BTM network;

FIG. 2 is a diagram of the BERT-BTM model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Detailed Description

The invention provides a method for detecting microblog emergency events based on a Bert and BTM network, which specifically comprises the following steps:

step 1: the microblog data set is read, and the collected data set is as follows:

[ Hua is the first day of chip outage, and the chip will not be supplied anymore to Hua for the future of 3000 hundred million dollars of Chinese chips ] 9 months 15 days, including Taiwan electricity, high pass, three stars, SK Hailishi, mei Guang and the like.

[ surprise ]! The drunk man overpass falls and is caught by the roof of a driver for 9 months and 13 days, and the drunk man in a certain city climbs out of the overpass, so that the situation is very critical. A van driver detects abnormality, drives the van to the lower part of a man, and catches the man at the moment of falling.

[ Hua is the first day of chip outage ] 15 days of America, the ban is effective, ni Guangna is said to be that Hua cannot be used without cores.

[ drunk man overpass falls and is caught by the driver with the vehicle ] outside a city-man hanging five meters high overpass, at a key moment, a hot citizen stops the minibus to catch the falling man.

[ Beijing cash powder Charyu ] 15 evening, beijing sky presents gold powder color under the irradiation of sunset ]! Such sky really loves-! This is also almost the case in summer at the end of the year-!

Filtering noise data such as special characters of an html tag and the like in a text data set through regular matching to obtain a cleaned text sequence, and then using a word segmentation tool to segment the cleaned text sequence, wherein a word segmentation device selects an open source tool ICTCLAS word segmentation system to obtain a segmented sequence; and then removing stop words in the microblog data set according to the stop word list, and storing the processed data set to obtain an original data set.

Raw data set:

huacheng/chip/outage/first day/china/chip/billion dollars/future/electricity/high-pass/samsung/sea/lishi/meiguang/no longer/supply/chip/give/Huacheng

Frightening/drunk/man/overpass/fall/driver/roof/catch/city/drunk/man/climb to/overpass/out/situation/ten/critical/van/driver/find/abnormal/car/open/man/down/man/fall/instant/catch/go

Hua Cheng/Chi/off-supply/first day/United states/ban/take effect/Ni Guangna/claim/Hua Cheng/don't/coreless/available

Drunk, man, overpass, fall, driver, roof, catch, city, man, hang, five meters, overpass, key, moment, hot, citizen, stop, minibus, catch, crash, man

Beijing/present/gold/pink/evening/Beijing/sky/sunset/irradiation/presentation/gold/colour/sky/near/summer/ending

Step 2: and carrying out vectorization processing on the original data set to obtain a vectorized word vector set, and then carrying out processing on the vectorized word vector set by calling a pre-training BERT model to obtain the BERT word vector set. And calling a pre-trained BERT model through an API interface at the client to acquire the BERT word vector set.

The BERT word vector set is a word vector set formed by word vectors corresponding to words in each microblog event.

Step 3, constructing a BERT-BTM model, and processing the original data set through the BERT-BTM model to obtain an emergency word set:

s3.1, a BERT-BTM topic model as shown in FIG. 2 is proposed. Obtaining event distribution theta in the microblog data set according to the dirichlet priors alpha, and obtaining an event z corresponding to the event distribution theta according to the event distribution theta; according to the prior parameter beta fusing the BERT word vector set _i Obtaining the event word distribution corresponding to the event zBased on the event z and event word distributionObtaining 2 different words w forming a word pair _i w _j The method comprises the steps of carrying out a first treatment on the surface of the Event word distribution->Composing the number k of input events; event z and word w _i w _j And forming word pair sets.

S3.2, processing the original data set by using a BERT-BTM model, wherein the method specifically comprises the following steps:

s3.2.1, obtaining event distribution of the microblog data set: to Dir ();

s3.2.2 obtaining the word distribution of event z _z _z Dir( _i )

S3.2.3, obtaining word pairs:

(a) Obtaining event z: z-Multi ();

(b) Obtaining word w _i w _j w _i w _j Multi( _z )

According to the word w _i w _j Obtaining a word pair b: b= (w) _i w _j )

Calculating the conditional probability of the word pair b:

where P (b) is the conditional probability of word pair b, P (z) = _z Representing the probability distribution of event z, P (w _i |z) _i|z Word w corresponding to the representation event z _i Probability distribution of P (w) _j |z) _j|z Word w corresponding to the representation event z _j Is a probability distribution of (c).

Calculating the probability of the word pair set B:

where P is the probability distribution of word pair set B, _z Representing the probability distribution of event z, phi _i|z Word w corresponding to the representation event z _i Probability distribution, phi _j|z Word w corresponding to the representation event z _j Is a probability distribution of (c).

Step S3.3BERT- and of the BTM model are inferred using Gibbs sampling methods. The Gibbs sampling method is an efficient markov chain-monte carlo MCMC sampling method that uses a conditional distribution of each variable to achieve sampling in a joint distribution. The BERT-BTM model extrapolates the document-event distribution steps as follows:

input data: number of events, number of iterations, alpha and beta _i Word pair sets, dictionary sizes;

the word pair set is a set of word pairs in the original data set;

outputting data: document-event distribution. The method specifically comprises the following steps:

s3.3.1 randomly assigning a topic to the word pair b;

calculate the word pair b= (w _i w _j ) Conditional probability distribution of (c):

wherein: z represents the event assignment of the word pair B, z-B represents the event assignment of the word pair set B excluding the word pair B;representing the number of times event z is assigned to the word pair b; />Indicating that event z is assigned to w _i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size, i.e. the number of words for which the original dataset is not repeated.

Updating

In the two formulas described above, the first and second compounds,representing the number of times event z is assigned to word pair b; n is n _b Representing the number of word pairs in the original dataset; t (T) Representing the number of events; />Indicating that event z is assigned to word w _i Is a number of times (1); />Indicating the number of times the word w is assigned to event z; m represents the dictionary size.

S3.3.4 according to said p (z), p (w _z ) Calculating to obtain word pair-event distribution p (z|b):

the document d and the original data set are the same data set;

s3.3.6, calculating a document-event distribution P (z|d) from the word pair-event distribution P (z|b) and the document-word pair distribution P (b|d):

The emergency distribution includes: the document-event distribution, event-word distribution.

Mapping the word vector set into an event vector set to obtain emergency distribution, and obtaining the words in the corresponding emergency word set of the current document through document-event distribution according to the emergency distribution; the corresponding emergency vocabulary is obtained through event-word distribution, and examples are as follows in tables 1 and 2:

TABLE 1

TABLE 2

The best topic number k=3 is obtained from the confusion, and the document-event distribution, the event distribution and the event-word distribution are respectively obtained (only the first 3 words with the largest proportion are reserved).

And 4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection.

The BERT-BTM network construction method is specifically as follows.

And constructing the BERT-BTM network by using the words in the emergency set obtained from the BERT-BTM model as points in the network and the co-occurrence relationship between the words in the emergency set as edges. The BERT-BTM network is represented using a data format NET file commonly used in complex networks, which defines all points and edges in the network. The NET file contains two parts of content, namely, vertictics and Edges, where vertictics describes nodes in the BERT-BTM network, and Edges describe Edges between nodes in the BERT-BTM network. Assuming { A, B, C } is the set of emergency words obtained from the microblog data set, the set is expressed in NET format, and the structure is shown in tables 3 and 4.

TABLE 3 Table 3

Vertices

Node ID	Node label
		1	A
2	B
		3	C

TABLE 4 Table 4

Edges

Start node ID	Endpoint node ID
		1	2
1	3
		2	3

And integrating the emergency word set obtained from the microblog data set into a node set VerticeSet and an edge set EdgesSet, and sequentially outputting the two sets into a NET file to obtain a BERT-BTM network, as shown in tables 5 and 6.

TABLE 5

Vertices

Node ID	Node label
		1	Huawei
2	Chip
		3	Drunk wine

		n	Fall down

TABLE 6

Edges

Start node ID	Endpoint node ID
		1	2
1	5
		1	13

		n	9

And dividing the network by adopting a GN algorithm, thereby discovering the emergency. The specific method comprises the following steps:

the GN algorithm classifies the network by continuously removing the edge with the highest edge betweenness when executing event detection tasks, and the GN algorithm flow is as follows:

after the emergency community is obtained through the GN algorithm, the words (emergency word sets) in the same community are used as clustering center points, and the corresponding n microblog events in the emergency word sets are clustered to find the microblog emergency clusters under the same microblog emergency. When clustering is carried out, a single-pass clustering method is used, the similarity S between the microblog event and the microblog emergency word set is calculated, and when the similarity S between the microblog event and the microblog emergency word set is larger than a threshold value, the microblog is considered to be the microblog describing the emergency.

Let C, H be the two word sets c= { C1, C2, C3, , ct }, h= { H1, H2, H3, , hm }. When calculating the similarity of two word sets, introducing a function R _(C,H) The similarity of the word set C relative to H is expressed as follows:

further, define C and H similarity S _(C,H) The method comprises the following steps:

similarity S between H and C _(C,H) When the similarity is greater than a certain threshold, the H and the C are considered to be similar, and microblog texts with similarity greater than the threshold are distributed into the same microblog emergency cluster to complete detection of the microblog emergency. The results are exemplified as follows:

clusters corresponding to each microblog (the clusters are denoted by the reference numerals 1 to 3, the reference numeral 1 corresponds to the first cluster, and so on):

obtaining event 1 described by the 1 st and 3 rd microblogs; the 2 nd and 4 th microblogs describe event number 2; microblog 5 describes event number 3 as shown in table 7.

TABLE 7

Microblog number	Reference numerals
		1	1
2	2
		3	1
4	2
		5	3

The incidents described by each cluster are represented by several feature words, as shown in table 8, examples are as follows:

TABLE 8

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for detecting the microblog emergency event based on the BERT-BTM network is characterized by comprising the following steps of:

s4, constructing a BERT-BTM network according to the emergency word set and the co-occurrence relation among words in the emergency word set, and then dividing the BERT-BTM network to finish emergency detection;

the step S3 includes:

s3.3, inputting the input data into the BERT-BTM model to obtain output data;

the input data includes the number of events, the number of iterations, the alpha, the beta _i Word pairsCollection, dictionary size;

the output data is in emergency distribution;

the event number is the number of events z in the microblog data set;

the word pair set is a set of word pairs in the original data set;

the dictionary size is the number of words that the original dataset does not repeat;

the step S3.3 specifically includes:

s3.3.1 randomly assigning a topic to the word pair b;

s3.3.3 calculating an event distribution p (z) and an event-word distribution p (w) _z )

s3.3.4 according to said p (z), p (w _z ) Obtaining a word pair-event distribution p (zb):

s3.3.5, calculating to obtain a document-word pair distribution p (bd):

the document d and the original data set are the same data set;

s3.3.6, calculating a document-event distribution P (zd) according to the word pair-event distribution P (zb) and the document-word pair distribution P (bd):

where P (zb) is a word pair-event distribution, P (bd) is a document-word pair distribution, and P (zd) is a document-event distribution.

2. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps: the step S3.2 specifically includes:

s3.2.1, obtaining event distribution of the microblog data set: to Dir ();

s3.2.2 obtaining the word distribution of event z _z _z Dir( _i )

S3.2.3, obtaining word pairs.

3. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:

the S3.2.3 obtains word pairs:

(a) Obtaining event z: z-Multi ();

(b) Obtaining word w _i w _j w _i w _j Multi( _z )

(c) According to the word w _i w _j Obtaining a word pair b: b= (w) _i w _j )

4. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:

the emergency distribution includes: the document-event distribution, event-word distribution;

5. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:

the method for constructing the BERT-BTM network comprises the following steps:

the BERT-BTM network uses a data format NET file representation;

the words in the emergency word set are used as nodes in the network;

6. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 1, wherein the method comprises the following steps:

the method for dividing the BERT-BTM network comprises the following steps: using GN algorithm to remove the edge with highest edge medium number continuously and divide the BERT-BTM network;

the GN algorithm flow is as follows: sequentially calculating the edge betweenness of each edge in the BERT-BTM network to be mined; finding the edge with the largest edge betweenness in the BERT-BTM network and then deleting the edge; recalculating the edge betweenness of all the remaining edges; repeating the steps until all edges are deleted;

7. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 6, wherein the method comprises the following steps:

the clustering method is single-side clustering: and calculating the similarity S between the microblog event and the emergency word set, wherein when the similarity S between the microblog event and the emergency word set is larger than a threshold value, the microblog event is the emergency corresponding to the emergency cluster.

8. The method for detecting the microblog emergency event based on the BERT-BTM network according to claim 7, wherein the method comprises the following steps:

the step of calculating the similarity S is as follows: setting two word sets to be C, H, and introducing similarity introduction function of word set C relative to H as R _(C,H)

The similarity introducing function of the word set H relative to C is R _(H,C)

C and H similarity S _(C,H) The method comprises the following steps:

when the similarity between H and C is S _(C,H) Above a certain threshold, H is considered similar to C, with similarity greater than the thresholdAnd distributing the events with the values to the same emergency cluster to finish the detection of the emergency.