CN111966917A

CN111966917A - Event detection and summarization method based on pre-training language model

Info

Publication number: CN111966917A
Application number: CN202010661898.2A
Authority: CN
Inventors: 卢国明; 段贵多; 秦科; 罗光春; 顾坚彬; 李康康
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-11-20
Anticipated expiration: 2040-07-10
Also published as: CN111966917B

Abstract

The invention discloses an event detection and abstraction method based on a pre-training language model, which is based on a social media platform, detects key events in hot topics, improves the event detection effect, and improves the representation effect of event contents by using an event abstraction. The method comprises S1: preprocessing a text; s2: vectorizing the text; s3: training an event detection model; s4: and displaying the mined events. The invention uses the pre-trained language model to mine the semantic and structural information of the input text, improves the representation effect of the text, completes the task of event detection and summarization by combining with the subsequent neural network, and improves the accuracy and recall rate of event detection and the semantic representation effect of event content.

Description

Event detection and summarization method based on pre-training language model

Technical Field

The invention relates to the field of data mining and natural language processing, in particular to an event detection and summarization method based on a pre-training language model.

Background

With the development of the internet, social media is integrated into our daily life. The vast masses discuss hot topics in life on the social media platforms to acquire social dynamics. These social media have become an important source for information to the masses. With the continuous development of application requirements in the fields of internet public sentiment and information security, it is very important to know the connotation of finer granularity, deeper level, more angles and more comprehensive sides under topics.

A topic is composed of a set of related events. A series of related events drive the development and change of topics. In the face of mass information, relevant events contained in the hot topics are extracted, the development process of the topics is favorably shown, and people can know the development context of the topics. It has become a serious challenge to effectively mine the topic events contained in the text.

Event detection is essentially a clustering process that clusters text into clusters, one cluster representing an event. Event detection algorithms can be broadly divided into two categories: a document-based approach that detects events by clustering documents based on semantic distance, e.g., using a vector space model based on TF-IDF to compute text similarity, and then clustering text streams in conjunction with a SinglePass clustering algorithm to detect occurring events; feature-based methods that study word distributions and discover event keywords through events, such as mining keywords for events using topic models and related refinement methods, while soft clustering documents according to the probability of the event to which the document belongs.

The two methods only stay at the word level when processing the text, and the deep information of the document cannot be deeply mined, so that the event detection effect is poor. The document-based method relies on word-level similarity comparison, can not process similar words and synonyms, has insufficient utilization of subject semantic information implicit in the document, and can not give consideration to the vocabulary semantic information of the document. The method based on the characteristics depends on characteristic selection, most social media texts are short texts, word co-occurrence relations are sparse, and the effect of the topic model is influenced. In addition, these methods all use keywords to represent the event content, and the semantic expression is ambiguous and easy to cause ambiguity.

Disclosure of Invention

At present, the effect of relevant tasks in the field of natural language processing is effectively improved by pre-trained language models such as BERT (basic transcription), and meanwhile, a neural network can also effectively model texts and process semantic and structural information of the texts. Therefore, aiming at the problems existing in the current method, the invention uses the pre-trained language model to mine the semantic information of the document, and combines the subsequent neural network to complete the event detection and the event summarization, thereby improving the event detection and expression effects.

The invention provides an event detection and summarization method based on a pre-training language model. The method and the device improve the effects of event detection and summarization by mining the semantic and structural information of the text. The method adopts a pre-trained language model to process the input text, grasps the semantic and structural information of the text, combines with a subsequent neural network to cluster the text, detects events in topics, and abstracts the events at the same time. The invention has better accuracy and recall rate on the event detection task, and meanwhile, the abstract improves the representation effect of the event content.

The invention comprises the following steps:

s1: preprocessing the input social media text, deleting the information which is not needed in the text and segmenting the text.

The specific sub-process is as follows:

s11: the input social media text set is marked as D, D ═ D₁,d₂,…,d_|D|And obtaining corresponding comments for each text in the D to obtainThe comment text set of (1) is denoted as C, C ═ C₁,c₂,…,c_|C|The method comprises the steps of (1) sharing | D | social media texts and | C | comment texts, and deleting short links and irrelevant information of @ other users in the texts by using a regular expression;

s12: using a word segmentation tool to segment words of the text, and deleting low-frequency words to obtain corresponding word sequence sets D 'and C', wherein

Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, and the subscripts for m and k represent the text number, e.g., m₁And k₁Denotes the length, w, of the first text in D 'and C', respectively_i,jRepresenting the jth word in the ith text, e.g.

And

respectively representing the m-th text in the first text in D' and C₁And k₁One word, the last word in the first text in D 'and C'.

S2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text. The specific sub-process is as follows:

s21: the length n of the text in the determined D 'and C' sets;

s22: and adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, wherein the texts with the length larger than n only keep the first n words, and the supplementary marks are added to the texts with the length smaller than n to ensure that the texts meet the requirement of the length n, so that the updated D 'and C' are obtained.

D′＝{(w_1,1,w_1,2,…,w_1,n),…,(w_|D|,1,w_|D|,2,w_|D|,3,…,w_|D|,n)}

C′＝{(w_1,1,w_1,2,…,w_1,n),…,(w_|C|,1,w_|C|,2,w_|C|,3,…,w_|C|,n)}

S23: semantic and structural information of texts is mined by using a BERT model, vectorization representation of input D ' and C ' is obtained, and each text in a D ' set can obtain a corresponding vector

Obtaining a set of vectors

The subscript indicates the number of the text, and each text in the same C' set obtains the corresponding vector

Obtaining a set of vectors

S3: vectorized-based text collection

And

and (3) solving an event vector by using a convolutional neural network and combining a memory network, and finishing the training of the model, wherein the specific sub-process is as follows:

s31: will be provided with

Obtaining global characteristics of key events by inputting convolutional neural network

The convolution formula is

Where w represents the weight matrix, h represents the size of the convolution kernel, b represents the bias, f represents the activation function, v_iRepresenting the characteristics of the event obtained by the convolution,

s32: will be provided with

The text vector in (1) is used as external information, and key event information in comments is combined

Inputting into memory network, learning to obtain event expression matrix E ═ E₁,e₂,…,e_k]Wherein e represents an event vector, k is a hyper-parameter determined in advance and represents that k events need to be mined;

s33: finally will be

The text vector in (1) is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and training of an event detection model is completed by combining a related preset loss function.

S4: representing matrices E and E based on events according to a trained event detection model

The text in (1) represents vectors to calculate similarity, completes event detection and event summarization, and displays events, and the specific sub-processes are as follows:

s41: calculating according to the event representation matrix E

Similarity of text and event vector in (1)

And normalized

Wherein alpha is_i,jTo represent

And e_jThe similarity of (2);

s42: event detection is realized based on similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event S_iCan be represented as S_i＝{d_r|i＝argmax_k(α_r,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the text, i represents the event number, and i is more than or equal to 1 and less than or equal to k.

S43: according to each event set S_iThe text with the maximum similarity is selected as the abstract of the event content.

The event detection and abstraction method based on the pre-training language model can dig key events in topics and understand development context of the topics. The invention utilizes the BERT model to carry out vectorization on the text, excavates the semantic and structural information of the input text, completes the event detection and abstract by combining the subsequent convolution network and the memory network, and improves the event detection and representation effect. The invention belongs to an unsupervised model and does not need additional data marking.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

FIG. 2 is a flowchart of an algorithm for vectorizing text using the BERT model;

FIG. 3 is a schematic diagram of a BERT model;

FIG. 4 is a flow diagram of an event detection model;

FIG. 5 is a schematic diagram of an event detection model;

FIG. 6 is a flowchart of event detection and summary.

Detailed Description

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such examples, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.

As described above, the event detection and summarization method based on the pre-trained language model provided by the invention can improve the accuracy and recall rate of event detection and the representation effect of event content.

As shown in fig. 1, in the embodiment of the present invention, a collected microblog data set is taken as an example, the data set includes 21300 microblog texts on an airline accident topic, which is denoted as D, and 10000 popular comment texts corresponding to the screened texts are denoted as C. As shown in table 1, two microblog texts are selected for display, and the comment text is similar to the microblog text.

TABLE 1 microblog text example

S1: preprocessing an input microblog text set D and a comment set C, wherein the specific sub-processes are as follows:

s11: text filtering

The microblog text contains a wide range of contents and various information, unnecessary information in D and C is deleted by using a regular expression, such as the text of @ other users contained in the text sample 1, data with the type of "@ xx" is identified and deleted by using the regular expression "(;

s12: text word segmentation

And performing word segmentation on the text by using a jieba word segmentation tool, selecting an accurate mode in the word segmentation mode, then counting the word frequency of all words, deleting the words with the word frequency less than 5, and finally reserving 4310 words to obtain the word sequence sets D 'and C' of the preprocessed input text. The text samples of table 1 were processed as shown in table 2.

TABLE 2 word sequences after preprocessing

Referring to fig. 2 and 3, S2 in the present invention uses a pre-trained BERT model to vectorize the input text, and this stage takes the processed word sequence sets D 'and C' as input, and after the processing by BERT model, a text vector set is obtained

And

the specific sub-process is as follows:

s21: determining text length

The neural network training needs to fix the input length, all microblog lengths are sequenced, the third quartile is taken as the maximum length corresponding to the data set used in the example, and the maximum length n is 51;

s22: fixed text length

The model training needs to be unified in size, the first n words of the text with the length exceeding n are cut off, the text with the length less than n is completed by using a special PAD mark symbol, mark symbols 'START' and 'END' need to be added at the beginning and the END of the text, and the updated text is recorded as D 'and C';

s23: BERT model vectorization

Inputting D 'and C' into a pre-trained BERT model, and taking [ CLS ] of the BERT]The expression vector is output as a text, the model structure is specifically shown in fig. 3, and a new text vector set can be obtained after the text is vectorized

And

wherein

And

the dimensionalities of the reference points are 21300 by 768 and 10000 by 768 respectively, wherein 768 is the dimensionality output by the BERT model, 21300 corresponds to the number of microblog texts, and 10000 corresponds to the number of comment texts;

referring to fig. 4 and 5, S3 of the present invention learns the event representation vector using neural network training, because the comments often relate to key events in the topic, and therefore the information of the key events in the comments is extracted using a convolutional network

And then combining with the learning of a memory network to obtain an event vector matrix E. The specific sub-process is as follows:

s31 extracting event characteristics by convolution network

The key information in the review is extracted using a convolutional network. 3 groups of convolutions are set, 128 convolution kernels in each group have the corresponding convolution kernel widths of 3, 4 and 5, the convolution kernel length of 768 is consistent with the dimensionality of output of the BERT model, the sliding step length is 1, and the output corresponding to three groups of convolutions is marked as V₁,V₂,V₃The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V₁,V₂,V₃Performing maximal pooling by row to obtain 3 128-dimensional vectors, and combining the 3 128 vectorsThe vectors are spliced into a 384-dimensional vector, and finally, the feature vector containing the global key event information can be extracted

S32: memory network fetch event representation matrix E

Obtaining global key event characteristics

Then, the number k of events to be mined needs to be determined in advance. Firstly, to

Carrying out k-time linear change to obtain event features of different semantic spaces

Wherein

W_iParameters to be learned for the event detection model, W_iDimension 384 x 384. Then, as shown in FIG. 5, will

As a query vector for the memory network,

as external information, e is calculated according to the following formula_iThe vectorization of (c) represents:

wherein alpha is_mThe m-th text is displayed with attention to the characteristics of the event, and the input is obtained according to the attention

The key event information t contained in_iFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to beta_i，W_K，W_V，W_EAll the events are obtained by training an event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively;

according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E₁,e₂,…,e_k]In this example, we take k to 20, so the dimension of matrix E is 20 × 768;

s33: decoder restores text set D'

And finally, restoring the input D' by using a decoder, wherein the decoder is based on a GRU structure, the output dimension is 4310 consistent with the total number of words, and the output word at the current moment is used as the input word at the next moment. The neural network needs a loss function to learn, which is shown below:

L_o＝||EE^T-I||

L＝0.5*L_a+0.5*L_o

wherein L represents the final loss function, represented by L_aAnd L_oTwo parts, L_aIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, and q (w) represents the actual word probabilities of the text. L is_oIs a loss function on the event vector that ensures irrelevancy between different events.

Referring to fig. 6, S4 of the present invention shows an event. Firstly, an event matrix E and a microblog vector set learned by an event detection model

And performing similarity calculation, dividing the microblogs into event clusters corresponding to the maximum similarity, completing event detection, and finally selecting the microblogs with the maximum similarity in the event clusters as an abstract for each event cluster to display the content of the event.

The specific sub-process is as follows:

s41: similarity calculation

According to the event matrix E obtained through learning, the similarity between each microblog and the event vector needs to be calculated, and the similarity is the basis for subsequently carrying out event detection and event summarization. The similarity calculation formula is as follows:

wherein alpha is_i,jRepresenting microblogs d_iAnd event e_jThe similarity of (2);

s42: event detection

The higher the similarity is, the higher the probability of the event to which the microblog belongs is, and the microblog is divided into the event cluster with the maximum event probability. Event set S_iIs calculated asAs shown below, in the present example, 20 events are set to be mined, so that the value range of i is 1-20;

S_i＝{d_r|i＝argmax_k(α_r,k),1≤r≤|D|}，

where r represents the number of the text.

Finally, the microblog sample example 1 in table 1 is included in event 15, whose content in the event text set is relevant for the black box to be found for delivery to france analysis, and sample example 2 is included in event 8, whose content in the event text set is relevant for the boeing share price drop.

S43: event summarization

For each set S_iAnd a representative microblog is selected as an abstract of the event, so that the expression effect of the content of the event is improved. Remember h_iAs a set of events S_iThe abstract sentence of (1) is labeled, then h_iThe calculation formula of (a) is as follows:

h_i＝argmax_k(α_k,i),s.t.d_k∈S_i

since there are many events, it is not convenient to show them all, so the above mentioned summaries of event 8 and event 15 are chosen to be shown, as shown in table 3.

Table 3 event summary example

In summary, the present invention provides an event detection and summarization method based on a pre-training language model, and the above description is only used to help understand the method and its core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there are changes in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation of the present invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention shall be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such changes and modifications that fall within the scope and bounds of the appended claims, or equivalents of such scope and bounds.

Claims

1. An event detection and summarization method based on a pre-training language model is characterized by comprising the following steps:

s1: preprocessing an input social media text, deleting unnecessary information in the text and segmenting the text, wherein the specific sub-process comprises the following steps:

s11: the input social media text set is marked as D, D ═ D₁，d₂，…，d_|D|Acquiring a corresponding comment for each text in the D, and recording an obtained comment text set as C, where C is { C ═ C }₁，c₂，…，c_|C|D | social media texts and C | comment texts are shared, and a regular expression is used for deleting short links and irrelevant information of other users in the texts;

Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, the subscripts of m and k represent the text numbers, m₁And k₁Denotes the length, w, of the first text in D 'and C', respectively_i，jRepresenting the jth word in the ith text,

and

respectively representing the m-th text in the first text in D' and C₁And k₁In individual words, i.e. D' and CThe last word in the first piece of text;

s2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text, wherein the specific sub-processes are as follows:

s21: the length n of the text in the determined D 'and C' sets;

s22: adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, only reserving the first n words for the texts with the length larger than n, adding supplementary marks to the texts with the length smaller than n to ensure that the texts meet the requirement of the length of n, obtaining updated D 'and C',

D′＝{(w_1，1，w_1，2，…，w_1，n)，...，(w_|D|，1，w_|D|，2，w_|D|，3，…，w_|D|，n)}

C′＝{(w_1，1，w_1，2，…，w_1，n)，...，(w_|C|，1，w_|C|，2，w_|C|，3，…，w_|C|，n)}；

Obtaining a set of vectors

Obtaining a set of vectors

S3: vectorized-based text collection

And

the event detection model obtains an event vector by using a convolutional neural network in combination with a memory network in combination with a loss function training, and the specific sub-process is as follows:

s31: will be provided with

The convolution formula is

s32: will be provided with

The text expression vector in (1) is used as external information, and key event information in comments is combined

The data is input into a memory network,learning to obtain an event representation matrix E ═ E₁,e₂,…,e_k]Wherein e represents an event vector, k is a predetermined hyper-parameter, and represents that k events need to be mined;

s33: finally will be

The text vector is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and the training of the event detection model is completed by combining a related preset loss function;

s4: representing matrices E and E based on events according to the trained event detection model

s41: calculating according to the event representation matrix E

Similarity of text and event vector in (1)

And normalized

Wherein alpha is_i,jRepresenting social media text

And e_jThe similarity of (2);

s42: event detection is realized based on the similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event S_iCan be represented as S_i＝{d_r|i＝argmax_k(α_r,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the social media text, i represents the event number, i is more than or equal to 1 and less than or equal to ik；

2. The method for detecting and abstracting events based on pre-trained language model as claimed in claim 1, wherein | D | is 21300 and | C | is 10000 in step S11.

3. The pre-trained language model based event detection and summarization method of claim 2 wherein the length n of the text in the D 'and C' sets determined in step S21 is 51; the step S22 is specifically to truncate n words before the text with the length exceeding n, complement the text with the length less than n by using a special "PAD" identifier, add the markers "START" and "END" at the beginning and END of the text, and record the updated text as D 'and C'; the [ CLS ] of BERT is taken out in the step S23]Outputting the expression vector as text, and vectorizing the text to obtain a new text vector set

And

wherein

And

21300 x 768, 10000 x 768, respectively, where 768 is the dimension output by the BERT model, 21300 the number of social media texts, 10000 the number of comment texts.

4. The pre-trained language model-based event detection and summarization method of claim 3, wherein the step S31 is specifically implemented by extracting key information in the comments by using a convolutional network, and settingSetting 3 groups of convolutions, wherein each group of convolutions has 128 convolution kernels, the corresponding convolution kernels have the widths of 3, 4 and 5, the convolution kernel length is 768, the convolution kernel length is consistent with the dimensionality of the output of the BERT model, the sliding step length is 1, and the output corresponding to the three groups of convolutions is marked as V₁,V₂,V₃The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V₁,V₂,V₃Performing maximal pooling by rows to obtain 3 128-dimensional vectors, splicing the 3 128 vectors into a 384-dimensional vector, and finally extracting the global features of the key events

5. The method for event detection and summarization based on a pre-trained language model as claimed in claim 4, wherein the step S32 is specifically performed after obtaining global features of key events

Then, the number k of events to be mined needs to be determined in advance, and firstly, the number k of events to be mined is determined

Wherein

W_iParameters to be learned for the event detection model, W_iDimension 384 x 384; then, will

As a query vector for the memory network,

The key event information t contained in_iFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to beta_i，W_K，W_V，W_EAll the events are obtained by training the event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively; according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E₁,e₂,…,e_k]And k is 20, so the dimension of the matrix E is 20 × 768.

6. The method for detecting and abstracting events based on a pre-trained language model as claimed in claim 5, wherein the step S33 is specifically to use a decoder to restore the input D', the decoder is based on a GRU structure, the output dimension is 4310 according to the total number of words, the output word at the current time is used as the input word at the next time, the event detection model needs a loss function to learn, and the loss function is as follows:

L_o＝||EE^T-I||

L＝0.5*L_a+0.5*L_o

wherein L represents the final loss function, represented by L_aAnd L_oTwo parts, L_aIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, q (w) represents the actual word probabilities of the text; l is_oIs a loss function on the event vector that ensures irrelevancy between different events.

7. The event detection and summarization method based on a pre-trained language model according to claim 6, wherein the greater the similarity in step S42, the higher the probability of the event that the social media text belongs to, the social media text is divided into event clusters with the highest probability of the event, and 20 events are set and mined, so that the value range of i is 1-20.

8. The pre-trained language model based event detection and summarization method of claim 7 wherein step S43 is specifically, for each set S_iSelecting a representative social media text as the abstract of the event, improving the expression effect of the content of the event, and recording h_iAs a set of events S_iThe abstract sentence label ofh_iThe calculation formula of (a) is as follows:

h_i＝argmax_k(α_k,i),s.t.d_k∈S_i。