CN111966917A - Event detection and summarization method based on pre-training language model - Google Patents

Event detection and summarization method based on pre-training language model Download PDF

Info

Publication number
CN111966917A
CN111966917A CN202010661898.2A CN202010661898A CN111966917A CN 111966917 A CN111966917 A CN 111966917A CN 202010661898 A CN202010661898 A CN 202010661898A CN 111966917 A CN111966917 A CN 111966917A
Authority
CN
China
Prior art keywords
text
event
events
event detection
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010661898.2A
Other languages
Chinese (zh)
Other versions
CN111966917B (en
Inventor
卢国明
段贵多
秦科
罗光春
顾坚彬
李康康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010661898.2A priority Critical patent/CN111966917B/en
Publication of CN111966917A publication Critical patent/CN111966917A/en
Application granted granted Critical
Publication of CN111966917B publication Critical patent/CN111966917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses an event detection and abstraction method based on a pre-training language model, which is based on a social media platform, detects key events in hot topics, improves the event detection effect, and improves the representation effect of event contents by using an event abstraction. The method comprises S1: preprocessing a text; s2: vectorizing the text; s3: training an event detection model; s4: and displaying the mined events. The invention uses the pre-trained language model to mine the semantic and structural information of the input text, improves the representation effect of the text, completes the task of event detection and summarization by combining with the subsequent neural network, and improves the accuracy and recall rate of event detection and the semantic representation effect of event content.

Description

Event detection and summarization method based on pre-training language model
Technical Field
The invention relates to the field of data mining and natural language processing, in particular to an event detection and summarization method based on a pre-training language model.
Background
With the development of the internet, social media is integrated into our daily life. The vast masses discuss hot topics in life on the social media platforms to acquire social dynamics. These social media have become an important source for information to the masses. With the continuous development of application requirements in the fields of internet public sentiment and information security, it is very important to know the connotation of finer granularity, deeper level, more angles and more comprehensive sides under topics.
A topic is composed of a set of related events. A series of related events drive the development and change of topics. In the face of mass information, relevant events contained in the hot topics are extracted, the development process of the topics is favorably shown, and people can know the development context of the topics. It has become a serious challenge to effectively mine the topic events contained in the text.
Event detection is essentially a clustering process that clusters text into clusters, one cluster representing an event. Event detection algorithms can be broadly divided into two categories: a document-based approach that detects events by clustering documents based on semantic distance, e.g., using a vector space model based on TF-IDF to compute text similarity, and then clustering text streams in conjunction with a SinglePass clustering algorithm to detect occurring events; feature-based methods that study word distributions and discover event keywords through events, such as mining keywords for events using topic models and related refinement methods, while soft clustering documents according to the probability of the event to which the document belongs.
The two methods only stay at the word level when processing the text, and the deep information of the document cannot be deeply mined, so that the event detection effect is poor. The document-based method relies on word-level similarity comparison, can not process similar words and synonyms, has insufficient utilization of subject semantic information implicit in the document, and can not give consideration to the vocabulary semantic information of the document. The method based on the characteristics depends on characteristic selection, most social media texts are short texts, word co-occurrence relations are sparse, and the effect of the topic model is influenced. In addition, these methods all use keywords to represent the event content, and the semantic expression is ambiguous and easy to cause ambiguity.
Disclosure of Invention
At present, the effect of relevant tasks in the field of natural language processing is effectively improved by pre-trained language models such as BERT (basic transcription), and meanwhile, a neural network can also effectively model texts and process semantic and structural information of the texts. Therefore, aiming at the problems existing in the current method, the invention uses the pre-trained language model to mine the semantic information of the document, and combines the subsequent neural network to complete the event detection and the event summarization, thereby improving the event detection and expression effects.
The invention provides an event detection and summarization method based on a pre-training language model. The method and the device improve the effects of event detection and summarization by mining the semantic and structural information of the text. The method adopts a pre-trained language model to process the input text, grasps the semantic and structural information of the text, combines with a subsequent neural network to cluster the text, detects events in topics, and abstracts the events at the same time. The invention has better accuracy and recall rate on the event detection task, and meanwhile, the abstract improves the representation effect of the event content.
The invention comprises the following steps:
s1: preprocessing the input social media text, deleting the information which is not needed in the text and segmenting the text.
The specific sub-process is as follows:
s11: the input social media text set is marked as D, D ═ D1,d2,…,d|D|And obtaining corresponding comments for each text in the D to obtainThe comment text set of (1) is denoted as C, C ═ C1,c2,…,c|C|The method comprises the steps of (1) sharing | D | social media texts and | C | comment texts, and deleting short links and irrelevant information of @ other users in the texts by using a regular expression;
s12: using a word segmentation tool to segment words of the text, and deleting low-frequency words to obtain corresponding word sequence sets D 'and C', wherein
Figure BDA0002578877740000021
Figure BDA0002578877740000022
Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, and the subscripts for m and k represent the text number, e.g., m1And k1Denotes the length, w, of the first text in D 'and C', respectivelyi,jRepresenting the jth word in the ith text, e.g.
Figure BDA0002578877740000023
And
Figure BDA0002578877740000024
respectively representing the m-th text in the first text in D' and C1And k1One word, the last word in the first text in D 'and C'.
S2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text. The specific sub-process is as follows:
s21: the length n of the text in the determined D 'and C' sets;
s22: and adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, wherein the texts with the length larger than n only keep the first n words, and the supplementary marks are added to the texts with the length smaller than n to ensure that the texts meet the requirement of the length n, so that the updated D 'and C' are obtained.
D′={(w1,1,w1,2,…,w1,n),…,(w|D|,1,w|D|,2,w|D|,3,…,w|D|,n)}
C′={(w1,1,w1,2,…,w1,n),…,(w|C|,1,w|C|,2,w|C|,3,…,w|C|,n)}
S23: semantic and structural information of texts is mined by using a BERT model, vectorization representation of input D ' and C ' is obtained, and each text in a D ' set can obtain a corresponding vector
Figure BDA0002578877740000025
Obtaining a set of vectors
Figure BDA0002578877740000026
Figure BDA0002578877740000027
The subscript indicates the number of the text, and each text in the same C' set obtains the corresponding vector
Figure BDA0002578877740000031
Obtaining a set of vectors
Figure BDA0002578877740000032
Figure BDA0002578877740000033
Figure BDA0002578877740000034
S3: vectorized-based text collection
Figure BDA0002578877740000035
And
Figure BDA0002578877740000036
and (3) solving an event vector by using a convolutional neural network and combining a memory network, and finishing the training of the model, wherein the specific sub-process is as follows:
s31: will be provided with
Figure BDA0002578877740000037
Obtaining global characteristics of key events by inputting convolutional neural network
Figure BDA0002578877740000038
The convolution formula is
Figure BDA0002578877740000039
Figure BDA00025788777400000310
Where w represents the weight matrix, h represents the size of the convolution kernel, b represents the bias, f represents the activation function, viRepresenting the characteristics of the event obtained by the convolution,
Figure BDA00025788777400000311
s32: will be provided with
Figure BDA00025788777400000312
The text vector in (1) is used as external information, and key event information in comments is combined
Figure BDA00025788777400000313
Inputting into memory network, learning to obtain event expression matrix E ═ E1,e2,…,ek]Wherein e represents an event vector, k is a hyper-parameter determined in advance and represents that k events need to be mined;
s33: finally will be
Figure BDA00025788777400000314
The text vector in (1) is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and training of an event detection model is completed by combining a related preset loss function.
S4: representing matrices E and E based on events according to a trained event detection model
Figure BDA00025788777400000315
The text in (1) represents vectors to calculate similarity, completes event detection and event summarization, and displays events, and the specific sub-processes are as follows:
s41: calculating according to the event representation matrix E
Figure BDA00025788777400000316
Similarity of text and event vector in (1)
Figure BDA00025788777400000317
And normalized
Figure BDA00025788777400000318
Wherein alpha isi,jTo represent
Figure BDA00025788777400000319
And ejThe similarity of (2);
s42: event detection is realized based on similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event SiCan be represented as Si={dr|i=argmaxkr,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the text, i represents the event number, and i is more than or equal to 1 and less than or equal to k.
S43: according to each event set SiThe text with the maximum similarity is selected as the abstract of the event content.
The event detection and abstraction method based on the pre-training language model can dig key events in topics and understand development context of the topics. The invention utilizes the BERT model to carry out vectorization on the text, excavates the semantic and structural information of the input text, completes the event detection and abstract by combining the subsequent convolution network and the memory network, and improves the event detection and representation effect. The invention belongs to an unsupervised model and does not need additional data marking.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
FIG. 2 is a flowchart of an algorithm for vectorizing text using the BERT model;
FIG. 3 is a schematic diagram of a BERT model;
FIG. 4 is a flow diagram of an event detection model;
FIG. 5 is a schematic diagram of an event detection model;
FIG. 6 is a flowchart of event detection and summary.
Detailed Description
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such examples, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.
As described above, the event detection and summarization method based on the pre-trained language model provided by the invention can improve the accuracy and recall rate of event detection and the representation effect of event content.
As shown in fig. 1, in the embodiment of the present invention, a collected microblog data set is taken as an example, the data set includes 21300 microblog texts on an airline accident topic, which is denoted as D, and 10000 popular comment texts corresponding to the screened texts are denoted as C. As shown in table 1, two microblog texts are selected for display, and the comment text is similar to the microblog text.
TABLE 1 microblog text example
Figure BDA0002578877740000041
S1: preprocessing an input microblog text set D and a comment set C, wherein the specific sub-processes are as follows:
s11: text filtering
The microblog text contains a wide range of contents and various information, unnecessary information in D and C is deleted by using a regular expression, such as the text of @ other users contained in the text sample 1, data with the type of "@ xx" is identified and deleted by using the regular expression "(;
s12: text word segmentation
And performing word segmentation on the text by using a jieba word segmentation tool, selecting an accurate mode in the word segmentation mode, then counting the word frequency of all words, deleting the words with the word frequency less than 5, and finally reserving 4310 words to obtain the word sequence sets D 'and C' of the preprocessed input text. The text samples of table 1 were processed as shown in table 2.
TABLE 2 word sequences after preprocessing
Figure BDA0002578877740000051
Referring to fig. 2 and 3, S2 in the present invention uses a pre-trained BERT model to vectorize the input text, and this stage takes the processed word sequence sets D 'and C' as input, and after the processing by BERT model, a text vector set is obtained
Figure BDA0002578877740000052
And
Figure BDA0002578877740000053
the specific sub-process is as follows:
s21: determining text length
The neural network training needs to fix the input length, all microblog lengths are sequenced, the third quartile is taken as the maximum length corresponding to the data set used in the example, and the maximum length n is 51;
s22: fixed text length
The model training needs to be unified in size, the first n words of the text with the length exceeding n are cut off, the text with the length less than n is completed by using a special PAD mark symbol, mark symbols 'START' and 'END' need to be added at the beginning and the END of the text, and the updated text is recorded as D 'and C';
s23: BERT model vectorization
Inputting D 'and C' into a pre-trained BERT model, and taking [ CLS ] of the BERT]The expression vector is output as a text, the model structure is specifically shown in fig. 3, and a new text vector set can be obtained after the text is vectorized
Figure BDA0002578877740000054
And
Figure BDA0002578877740000055
wherein
Figure BDA0002578877740000056
And
Figure BDA0002578877740000057
the dimensionalities of the reference points are 21300 by 768 and 10000 by 768 respectively, wherein 768 is the dimensionality output by the BERT model, 21300 corresponds to the number of microblog texts, and 10000 corresponds to the number of comment texts;
referring to fig. 4 and 5, S3 of the present invention learns the event representation vector using neural network training, because the comments often relate to key events in the topic, and therefore the information of the key events in the comments is extracted using a convolutional network
Figure BDA0002578877740000058
And then combining with the learning of a memory network to obtain an event vector matrix E. The specific sub-process is as follows:
s31 extracting event characteristics by convolution network
The key information in the review is extracted using a convolutional network. 3 groups of convolutions are set, 128 convolution kernels in each group have the corresponding convolution kernel widths of 3, 4 and 5, the convolution kernel length of 768 is consistent with the dimensionality of output of the BERT model, the sliding step length is 1, and the output corresponding to three groups of convolutions is marked as V1,V2,V3The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V1,V2,V3Performing maximal pooling by row to obtain 3 128-dimensional vectors, and combining the 3 128 vectorsThe vectors are spliced into a 384-dimensional vector, and finally, the feature vector containing the global key event information can be extracted
Figure BDA00025788777400000614
S32: memory network fetch event representation matrix E
Obtaining global key event characteristics
Figure BDA0002578877740000061
Then, the number k of events to be mined needs to be determined in advance. Firstly, to
Figure BDA0002578877740000062
Carrying out k-time linear change to obtain event features of different semantic spaces
Figure BDA0002578877740000063
Wherein
Figure BDA0002578877740000064
WiParameters to be learned for the event detection model, WiDimension 384 x 384. Then, as shown in FIG. 5, will
Figure BDA0002578877740000065
As a query vector for the memory network,
Figure BDA0002578877740000066
as external information, e is calculated according to the following formulaiThe vectorization of (c) represents:
Figure BDA0002578877740000067
Figure BDA0002578877740000068
Figure BDA0002578877740000069
Figure BDA00025788777400000610
Figure BDA00025788777400000611
wherein alpha ismThe m-th text is displayed with attention to the characteristics of the event, and the input is obtained according to the attention
Figure BDA00025788777400000612
The key event information t contained iniFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to betai,WK,WV,WEAll the events are obtained by training an event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively;
according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E1,e2,…,ek]In this example, we take k to 20, so the dimension of matrix E is 20 × 768;
s33: decoder restores text set D'
And finally, restoring the input D' by using a decoder, wherein the decoder is based on a GRU structure, the output dimension is 4310 consistent with the total number of words, and the output word at the current moment is used as the input word at the next moment. The neural network needs a loss function to learn, which is shown below:
Figure BDA00025788777400000613
Lo=||EET-I||
L=0.5*La+0.5*Lo
wherein L represents the final loss function, represented by LaAnd LoTwo parts, LaIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, and q (w) represents the actual word probabilities of the text. L isoIs a loss function on the event vector that ensures irrelevancy between different events.
Referring to fig. 6, S4 of the present invention shows an event. Firstly, an event matrix E and a microblog vector set learned by an event detection model
Figure BDA0002578877740000073
And performing similarity calculation, dividing the microblogs into event clusters corresponding to the maximum similarity, completing event detection, and finally selecting the microblogs with the maximum similarity in the event clusters as an abstract for each event cluster to display the content of the event.
The specific sub-process is as follows:
s41: similarity calculation
According to the event matrix E obtained through learning, the similarity between each microblog and the event vector needs to be calculated, and the similarity is the basis for subsequently carrying out event detection and event summarization. The similarity calculation formula is as follows:
Figure BDA0002578877740000071
Figure BDA0002578877740000072
wherein alpha isi,jRepresenting microblogs diAnd event ejThe similarity of (2);
s42: event detection
The higher the similarity is, the higher the probability of the event to which the microblog belongs is, and the microblog is divided into the event cluster with the maximum event probability. Event set SiIs calculated asAs shown below, in the present example, 20 events are set to be mined, so that the value range of i is 1-20;
Si={dr|i=argmaxkr,k),1≤r≤|D|},
where r represents the number of the text.
Finally, the microblog sample example 1 in table 1 is included in event 15, whose content in the event text set is relevant for the black box to be found for delivery to france analysis, and sample example 2 is included in event 8, whose content in the event text set is relevant for the boeing share price drop.
S43: event summarization
For each set SiAnd a representative microblog is selected as an abstract of the event, so that the expression effect of the content of the event is improved. Remember hiAs a set of events SiThe abstract sentence of (1) is labeled, then hiThe calculation formula of (a) is as follows:
hi=argmaxkk,i),s.t.dk∈Si
since there are many events, it is not convenient to show them all, so the above mentioned summaries of event 8 and event 15 are chosen to be shown, as shown in table 3.
Table 3 event summary example
Figure BDA0002578877740000081
In summary, the present invention provides an event detection and summarization method based on a pre-training language model, and the above description is only used to help understand the method and its core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there are changes in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation of the present invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention shall be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such changes and modifications that fall within the scope and bounds of the appended claims, or equivalents of such scope and bounds.

Claims (8)

1. An event detection and summarization method based on a pre-training language model is characterized by comprising the following steps:
s1: preprocessing an input social media text, deleting unnecessary information in the text and segmenting the text, wherein the specific sub-process comprises the following steps:
s11: the input social media text set is marked as D, D ═ D1,d2,…,d|D|Acquiring a corresponding comment for each text in the D, and recording an obtained comment text set as C, where C is { C ═ C }1,c2,…,c|C|D | social media texts and C | comment texts are shared, and a regular expression is used for deleting short links and irrelevant information of other users in the texts;
s12: using a word segmentation tool to segment words of the text, and deleting low-frequency words to obtain corresponding word sequence sets D 'and C', wherein
Figure FDA0002578877730000011
Figure FDA0002578877730000012
Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, the subscripts of m and k represent the text numbers, m1And k1Denotes the length, w, of the first text in D 'and C', respectivelyi,jRepresenting the jth word in the ith text,
Figure FDA0002578877730000013
and
Figure FDA0002578877730000014
respectively representing the m-th text in the first text in D' and C1And k1In individual words, i.e. D' and CThe last word in the first piece of text;
s2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text, wherein the specific sub-processes are as follows:
s21: the length n of the text in the determined D 'and C' sets;
s22: adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, only reserving the first n words for the texts with the length larger than n, adding supplementary marks to the texts with the length smaller than n to ensure that the texts meet the requirement of the length of n, obtaining updated D 'and C',
D′={(w1,1,w1,2,…,w1,n),...,(w|D|,1,w|D|,2,w|D|,3,…,w|D|,n)}
C′={(w1,1,w1,2,…,w1,n),...,(w|C|,1,w|C|,2,w|C|,3,…,w|C|,n)};
s23: semantic and structural information of texts is mined by using a BERT model, vectorization representation of input D ' and C ' is obtained, and each text in a D ' set can obtain a corresponding vector
Figure FDA0002578877730000015
Obtaining a set of vectors
Figure FDA0002578877730000016
Figure FDA0002578877730000017
The subscript indicates the number of the text, and each text in the same C' set obtains the corresponding vector
Figure FDA0002578877730000018
Obtaining a set of vectors
Figure FDA0002578877730000019
Figure FDA00025788777300000110
Figure FDA00025788777300000111
S3: vectorized-based text collection
Figure FDA00025788777300000112
And
Figure FDA00025788777300000113
the event detection model obtains an event vector by using a convolutional neural network in combination with a memory network in combination with a loss function training, and the specific sub-process is as follows:
s31: will be provided with
Figure FDA00025788777300000114
Obtaining global characteristics of key events by inputting convolutional neural network
Figure FDA00025788777300000115
The convolution formula is
Figure FDA00025788777300000116
Where w represents the weight matrix, h represents the size of the convolution kernel, b represents the bias, f represents the activation function, viRepresenting the characteristics of the event obtained by the convolution,
Figure FDA0002578877730000021
s32: will be provided with
Figure FDA0002578877730000022
The text expression vector in (1) is used as external information, and key event information in comments is combined
Figure FDA0002578877730000023
The data is input into a memory network,learning to obtain an event representation matrix E ═ E1,e2,…,ek]Wherein e represents an event vector, k is a predetermined hyper-parameter, and represents that k events need to be mined;
s33: finally will be
Figure FDA0002578877730000024
The text vector is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and the training of the event detection model is completed by combining a related preset loss function;
s4: representing matrices E and E based on events according to the trained event detection model
Figure FDA0002578877730000025
The text in (1) represents vectors to calculate similarity, completes event detection and event summarization, and displays events, and the specific sub-processes are as follows:
s41: calculating according to the event representation matrix E
Figure FDA0002578877730000026
Similarity of text and event vector in (1)
Figure FDA0002578877730000027
And normalized
Figure FDA0002578877730000028
Wherein alpha isi,jRepresenting social media text
Figure FDA0002578877730000029
And ejThe similarity of (2);
s42: event detection is realized based on the similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event SiCan be represented as Si={dr|i=argmaxkr,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the social media text, i represents the event number, i is more than or equal to 1 and less than or equal to ik;
S43: according to each event set SiThe text with the maximum similarity is selected as the abstract of the event content.
2. The method for detecting and abstracting events based on pre-trained language model as claimed in claim 1, wherein | D | is 21300 and | C | is 10000 in step S11.
3. The pre-trained language model based event detection and summarization method of claim 2 wherein the length n of the text in the D 'and C' sets determined in step S21 is 51; the step S22 is specifically to truncate n words before the text with the length exceeding n, complement the text with the length less than n by using a special "PAD" identifier, add the markers "START" and "END" at the beginning and END of the text, and record the updated text as D 'and C'; the [ CLS ] of BERT is taken out in the step S23]Outputting the expression vector as text, and vectorizing the text to obtain a new text vector set
Figure FDA00025788777300000210
And
Figure FDA00025788777300000211
wherein
Figure FDA00025788777300000212
And
Figure FDA00025788777300000213
21300 x 768, 10000 x 768, respectively, where 768 is the dimension output by the BERT model, 21300 the number of social media texts, 10000 the number of comment texts.
4. The pre-trained language model-based event detection and summarization method of claim 3, wherein the step S31 is specifically implemented by extracting key information in the comments by using a convolutional network, and settingSetting 3 groups of convolutions, wherein each group of convolutions has 128 convolution kernels, the corresponding convolution kernels have the widths of 3, 4 and 5, the convolution kernel length is 768, the convolution kernel length is consistent with the dimensionality of the output of the BERT model, the sliding step length is 1, and the output corresponding to the three groups of convolutions is marked as V1,V2,V3The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V1,V2,V3Performing maximal pooling by rows to obtain 3 128-dimensional vectors, splicing the 3 128 vectors into a 384-dimensional vector, and finally extracting the global features of the key events
Figure FDA0002578877730000031
5. The method for event detection and summarization based on a pre-trained language model as claimed in claim 4, wherein the step S32 is specifically performed after obtaining global features of key events
Figure FDA0002578877730000032
Then, the number k of events to be mined needs to be determined in advance, and firstly, the number k of events to be mined is determined
Figure FDA0002578877730000033
Carrying out k-time linear change to obtain event features of different semantic spaces
Figure FDA0002578877730000034
Wherein
Figure FDA0002578877730000035
WiParameters to be learned for the event detection model, WiDimension 384 x 384; then, will
Figure FDA0002578877730000036
As a query vector for the memory network,
Figure FDA0002578877730000037
as external information, e is calculated according to the following formulaiThe vectorization of (c) represents:
Figure FDA0002578877730000038
Figure FDA0002578877730000039
Figure FDA00025788777300000310
Figure FDA00025788777300000311
Figure FDA00025788777300000312
wherein alpha ismThe m-th text is displayed with attention to the characteristics of the event, and the input is obtained according to the attention
Figure FDA00025788777300000313
The key event information t contained iniFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to betai,WK,WV,WEAll the events are obtained by training the event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively; according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E1,e2,…,ek]And k is 20, so the dimension of the matrix E is 20 × 768.
6. The method for detecting and abstracting events based on a pre-trained language model as claimed in claim 5, wherein the step S33 is specifically to use a decoder to restore the input D', the decoder is based on a GRU structure, the output dimension is 4310 according to the total number of words, the output word at the current time is used as the input word at the next time, the event detection model needs a loss function to learn, and the loss function is as follows:
Figure FDA00025788777300000314
Lo=||EET-I||
L=0.5*La+0.5*Lo
wherein L represents the final loss function, represented by LaAnd LoTwo parts, LaIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, q (w) represents the actual word probabilities of the text; l isoIs a loss function on the event vector that ensures irrelevancy between different events.
7. The event detection and summarization method based on a pre-trained language model according to claim 6, wherein the greater the similarity in step S42, the higher the probability of the event that the social media text belongs to, the social media text is divided into event clusters with the highest probability of the event, and 20 events are set and mined, so that the value range of i is 1-20.
8. The pre-trained language model based event detection and summarization method of claim 7 wherein step S43 is specifically, for each set SiSelecting a representative social media text as the abstract of the event, improving the expression effect of the content of the event, and recording hiAs a set of events SiThe abstract sentence label ofhiThe calculation formula of (a) is as follows:
hi=argmaxkk,i),s.t.dk∈Si
CN202010661898.2A 2020-07-10 2020-07-10 Event detection and summarization method based on pre-training language model Active CN111966917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010661898.2A CN111966917B (en) 2020-07-10 2020-07-10 Event detection and summarization method based on pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010661898.2A CN111966917B (en) 2020-07-10 2020-07-10 Event detection and summarization method based on pre-training language model

Publications (2)

Publication Number Publication Date
CN111966917A true CN111966917A (en) 2020-11-20
CN111966917B CN111966917B (en) 2022-05-03

Family

ID=73361680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010661898.2A Active CN111966917B (en) 2020-07-10 2020-07-10 Event detection and summarization method based on pre-training language model

Country Status (1)

Country Link
CN (1) CN111966917B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528650A (en) * 2020-12-18 2021-03-19 恩亿科(北京)数据科技有限公司 Method, system and computer equipment for pretraining Bert model
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system
CN112597366A (en) * 2020-11-25 2021-04-02 中国电子科技网络信息安全有限公司 Encoder-Decoder-based event extraction method
CN112699675A (en) * 2020-12-30 2021-04-23 平安科技(深圳)有限公司 Text processing method, device and equipment and computer readable storage medium
CN112836486A (en) * 2020-12-09 2021-05-25 天津大学 Group hidden-in-field analysis method based on word vectors and Bert
CN112949318A (en) * 2021-03-03 2021-06-11 电子科技大学 Text position detection method based on text and user representation learning
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training
CN113254636A (en) * 2021-04-27 2021-08-13 上海大学 Remote supervision entity relationship classification method based on example weight dispersion
CN113434632A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Text completion method, device, equipment and storage medium based on language model
CN113434662A (en) * 2021-06-24 2021-09-24 平安国际智慧城市科技股份有限公司 Text abstract generation method, device, equipment and storage medium
CN113688230A (en) * 2021-07-21 2021-11-23 武汉众智数字技术有限公司 Text abstract generation method and system
CN113806528A (en) * 2021-07-07 2021-12-17 哈尔滨工业大学(威海) Topic detection method and device based on BERT model and storage medium
CN114357022A (en) * 2021-12-23 2022-04-15 北京中视广信科技有限公司 Media content association mining method based on event relation discovery
CN116050383A (en) * 2023-03-29 2023-05-02 珠海金智维信息科技有限公司 Financial product sales link flyer call detection method and system
CN117670571A (en) * 2024-01-30 2024-03-08 昆明理工大学 Incremental social media event detection method based on heterogeneous message graph relation embedding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116054A1 (en) * 2013-12-02 2017-04-27 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN111026861A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116054A1 (en) * 2013-12-02 2017-04-27 Qbase, LLC Event detection through text analysis using dynamic self evolving/learning module
CN110147452A (en) * 2019-05-17 2019-08-20 北京理工大学 A kind of coarseness sentiment analysis method based on level BERT neural network
CN111026861A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUANDAN CHEN等: "An Encoder-Memory-Decoder Framework for Sub-Event Detection in Social Media", 《2018 ASSOCIATION FOR COMPUTING MACHINERY》 *
施喆尔等: "基于语言模型及循环卷积神经网络的事件检测", 《厦门大学学报(自然科学版)》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597366B (en) * 2020-11-25 2022-03-18 中国电子科技网络信息安全有限公司 Encoder-Decoder-based event extraction method
CN112597366A (en) * 2020-11-25 2021-04-02 中国电子科技网络信息安全有限公司 Encoder-Decoder-based event extraction method
CN112836486A (en) * 2020-12-09 2021-05-25 天津大学 Group hidden-in-field analysis method based on word vectors and Bert
CN112836486B (en) * 2020-12-09 2022-06-03 天津大学 Group hidden-in-place analysis method based on word vectors and Bert
CN112528650A (en) * 2020-12-18 2021-03-19 恩亿科(北京)数据科技有限公司 Method, system and computer equipment for pretraining Bert model
CN112528650B (en) * 2020-12-18 2024-04-02 恩亿科(北京)数据科技有限公司 Bert model pre-training method, system and computer equipment
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system
CN112699675A (en) * 2020-12-30 2021-04-23 平安科技(深圳)有限公司 Text processing method, device and equipment and computer readable storage medium
CN112699675B (en) * 2020-12-30 2023-09-12 平安科技(深圳)有限公司 Text processing method, device, equipment and computer readable storage medium
CN112949318A (en) * 2021-03-03 2021-06-11 电子科技大学 Text position detection method based on text and user representation learning
CN112949318B (en) * 2021-03-03 2022-03-25 电子科技大学 Text position detection method based on text and user representation learning
CN113254636A (en) * 2021-04-27 2021-08-13 上海大学 Remote supervision entity relationship classification method based on example weight dispersion
CN112966115A (en) * 2021-05-18 2021-06-15 东南大学 Active learning event extraction method based on memory loss prediction and delay training
CN113434662B (en) * 2021-06-24 2022-06-24 平安国际智慧城市科技股份有限公司 Text abstract generating method, device, equipment and storage medium
CN113434662A (en) * 2021-06-24 2021-09-24 平安国际智慧城市科技股份有限公司 Text abstract generation method, device, equipment and storage medium
CN113434632A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Text completion method, device, equipment and storage medium based on language model
CN113806528A (en) * 2021-07-07 2021-12-17 哈尔滨工业大学(威海) Topic detection method and device based on BERT model and storage medium
CN113688230A (en) * 2021-07-21 2021-11-23 武汉众智数字技术有限公司 Text abstract generation method and system
CN114357022A (en) * 2021-12-23 2022-04-15 北京中视广信科技有限公司 Media content association mining method based on event relation discovery
CN114357022B (en) * 2021-12-23 2024-05-07 北京中视广信科技有限公司 Media content association mining method based on event relation discovery
CN116050383A (en) * 2023-03-29 2023-05-02 珠海金智维信息科技有限公司 Financial product sales link flyer call detection method and system
CN117670571A (en) * 2024-01-30 2024-03-08 昆明理工大学 Incremental social media event detection method based on heterogeneous message graph relation embedding
CN117670571B (en) * 2024-01-30 2024-04-19 昆明理工大学 Incremental social media event detection method based on heterogeneous message graph relation embedding

Also Published As

Publication number Publication date
CN111966917B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN108733742B (en) Global normalized reader system and method
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN107808011B (en) Information classification extraction method and device, computer equipment and storage medium
US20120253792A1 (en) Sentiment Classification Based on Supervised Latent N-Gram Analysis
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN111291188B (en) Intelligent information extraction method and system
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN112256866B (en) Text fine-grained emotion analysis algorithm based on deep learning
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
CN110263325A (en) Chinese automatic word-cut
Xing et al. A convolutional neural network for aspect-level sentiment classification
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN112287197A (en) Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN113987175A (en) Text multi-label classification method based on enhanced representation of medical topic word list
CN110674293B (en) Text classification method based on semantic migration
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
Xu et al. Research on Depression Tendency Detection Based on Image and Text Fusion
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment
CN116049349A (en) Small sample intention recognition method based on multi-level attention and hierarchical category characteristics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant