CN111966917A - Event detection and summarization method based on pre-training language model - Google Patents
Event detection and summarization method based on pre-training language model Download PDFInfo
- Publication number
- CN111966917A CN111966917A CN202010661898.2A CN202010661898A CN111966917A CN 111966917 A CN111966917 A CN 111966917A CN 202010661898 A CN202010661898 A CN 202010661898A CN 111966917 A CN111966917 A CN 111966917A
- Authority
- CN
- China
- Prior art keywords
- text
- event
- events
- event detection
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses an event detection and abstraction method based on a pre-training language model, which is based on a social media platform, detects key events in hot topics, improves the event detection effect, and improves the representation effect of event contents by using an event abstraction. The method comprises S1: preprocessing a text; s2: vectorizing the text; s3: training an event detection model; s4: and displaying the mined events. The invention uses the pre-trained language model to mine the semantic and structural information of the input text, improves the representation effect of the text, completes the task of event detection and summarization by combining with the subsequent neural network, and improves the accuracy and recall rate of event detection and the semantic representation effect of event content.
Description
Technical Field
The invention relates to the field of data mining and natural language processing, in particular to an event detection and summarization method based on a pre-training language model.
Background
With the development of the internet, social media is integrated into our daily life. The vast masses discuss hot topics in life on the social media platforms to acquire social dynamics. These social media have become an important source for information to the masses. With the continuous development of application requirements in the fields of internet public sentiment and information security, it is very important to know the connotation of finer granularity, deeper level, more angles and more comprehensive sides under topics.
A topic is composed of a set of related events. A series of related events drive the development and change of topics. In the face of mass information, relevant events contained in the hot topics are extracted, the development process of the topics is favorably shown, and people can know the development context of the topics. It has become a serious challenge to effectively mine the topic events contained in the text.
Event detection is essentially a clustering process that clusters text into clusters, one cluster representing an event. Event detection algorithms can be broadly divided into two categories: a document-based approach that detects events by clustering documents based on semantic distance, e.g., using a vector space model based on TF-IDF to compute text similarity, and then clustering text streams in conjunction with a SinglePass clustering algorithm to detect occurring events; feature-based methods that study word distributions and discover event keywords through events, such as mining keywords for events using topic models and related refinement methods, while soft clustering documents according to the probability of the event to which the document belongs.
The two methods only stay at the word level when processing the text, and the deep information of the document cannot be deeply mined, so that the event detection effect is poor. The document-based method relies on word-level similarity comparison, can not process similar words and synonyms, has insufficient utilization of subject semantic information implicit in the document, and can not give consideration to the vocabulary semantic information of the document. The method based on the characteristics depends on characteristic selection, most social media texts are short texts, word co-occurrence relations are sparse, and the effect of the topic model is influenced. In addition, these methods all use keywords to represent the event content, and the semantic expression is ambiguous and easy to cause ambiguity.
Disclosure of Invention
At present, the effect of relevant tasks in the field of natural language processing is effectively improved by pre-trained language models such as BERT (basic transcription), and meanwhile, a neural network can also effectively model texts and process semantic and structural information of the texts. Therefore, aiming at the problems existing in the current method, the invention uses the pre-trained language model to mine the semantic information of the document, and combines the subsequent neural network to complete the event detection and the event summarization, thereby improving the event detection and expression effects.
The invention provides an event detection and summarization method based on a pre-training language model. The method and the device improve the effects of event detection and summarization by mining the semantic and structural information of the text. The method adopts a pre-trained language model to process the input text, grasps the semantic and structural information of the text, combines with a subsequent neural network to cluster the text, detects events in topics, and abstracts the events at the same time. The invention has better accuracy and recall rate on the event detection task, and meanwhile, the abstract improves the representation effect of the event content.
The invention comprises the following steps:
s1: preprocessing the input social media text, deleting the information which is not needed in the text and segmenting the text.
The specific sub-process is as follows:
s11: the input social media text set is marked as D, D ═ D1,d2,…,d|D|And obtaining corresponding comments for each text in the D to obtainThe comment text set of (1) is denoted as C, C ═ C1,c2,…,c|C|The method comprises the steps of (1) sharing | D | social media texts and | C | comment texts, and deleting short links and irrelevant information of @ other users in the texts by using a regular expression;
s12: using a word segmentation tool to segment words of the text, and deleting low-frequency words to obtain corresponding word sequence sets D 'and C', wherein Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, and the subscripts for m and k represent the text number, e.g., m1And k1Denotes the length, w, of the first text in D 'and C', respectivelyi,jRepresenting the jth word in the ith text, e.g.Andrespectively representing the m-th text in the first text in D' and C1And k1One word, the last word in the first text in D 'and C'.
S2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text. The specific sub-process is as follows:
s21: the length n of the text in the determined D 'and C' sets;
s22: and adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, wherein the texts with the length larger than n only keep the first n words, and the supplementary marks are added to the texts with the length smaller than n to ensure that the texts meet the requirement of the length n, so that the updated D 'and C' are obtained.
D′={(w1,1,w1,2,…,w1,n),…,(w|D|,1,w|D|,2,w|D|,3,…,w|D|,n)}
C′={(w1,1,w1,2,…,w1,n),…,(w|C|,1,w|C|,2,w|C|,3,…,w|C|,n)}
S23: semantic and structural information of texts is mined by using a BERT model, vectorization representation of input D ' and C ' is obtained, and each text in a D ' set can obtain a corresponding vectorObtaining a set of vectors The subscript indicates the number of the text, and each text in the same C' set obtains the corresponding vectorObtaining a set of vectors
S3: vectorized-based text collectionAndand (3) solving an event vector by using a convolutional neural network and combining a memory network, and finishing the training of the model, wherein the specific sub-process is as follows:
s31: will be provided withObtaining global characteristics of key events by inputting convolutional neural networkThe convolution formula is Where w represents the weight matrix, h represents the size of the convolution kernel, b represents the bias, f represents the activation function, viRepresenting the characteristics of the event obtained by the convolution,
s32: will be provided withThe text vector in (1) is used as external information, and key event information in comments is combinedInputting into memory network, learning to obtain event expression matrix E ═ E1,e2,…,ek]Wherein e represents an event vector, k is a hyper-parameter determined in advance and represents that k events need to be mined;
s33: finally will beThe text vector in (1) is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and training of an event detection model is completed by combining a related preset loss function.
S4: representing matrices E and E based on events according to a trained event detection modelThe text in (1) represents vectors to calculate similarity, completes event detection and event summarization, and displays events, and the specific sub-processes are as follows:
s41: calculating according to the event representation matrix ESimilarity of text and event vector in (1)And normalizedWherein alpha isi,jTo representAnd ejThe similarity of (2);
s42: event detection is realized based on similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event SiCan be represented as Si={dr|i=argmaxk(αr,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the text, i represents the event number, and i is more than or equal to 1 and less than or equal to k.
S43: according to each event set SiThe text with the maximum similarity is selected as the abstract of the event content.
The event detection and abstraction method based on the pre-training language model can dig key events in topics and understand development context of the topics. The invention utilizes the BERT model to carry out vectorization on the text, excavates the semantic and structural information of the input text, completes the event detection and abstract by combining the subsequent convolution network and the memory network, and improves the event detection and representation effect. The invention belongs to an unsupervised model and does not need additional data marking.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
FIG. 2 is a flowchart of an algorithm for vectorizing text using the BERT model;
FIG. 3 is a schematic diagram of a BERT model;
FIG. 4 is a flow diagram of an event detection model;
FIG. 5 is a schematic diagram of an event detection model;
FIG. 6 is a flowchart of event detection and summary.
Detailed Description
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such examples, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.
As described above, the event detection and summarization method based on the pre-trained language model provided by the invention can improve the accuracy and recall rate of event detection and the representation effect of event content.
As shown in fig. 1, in the embodiment of the present invention, a collected microblog data set is taken as an example, the data set includes 21300 microblog texts on an airline accident topic, which is denoted as D, and 10000 popular comment texts corresponding to the screened texts are denoted as C. As shown in table 1, two microblog texts are selected for display, and the comment text is similar to the microblog text.
TABLE 1 microblog text example
S1: preprocessing an input microblog text set D and a comment set C, wherein the specific sub-processes are as follows:
s11: text filtering
The microblog text contains a wide range of contents and various information, unnecessary information in D and C is deleted by using a regular expression, such as the text of @ other users contained in the text sample 1, data with the type of "@ xx" is identified and deleted by using the regular expression "(;
s12: text word segmentation
And performing word segmentation on the text by using a jieba word segmentation tool, selecting an accurate mode in the word segmentation mode, then counting the word frequency of all words, deleting the words with the word frequency less than 5, and finally reserving 4310 words to obtain the word sequence sets D 'and C' of the preprocessed input text. The text samples of table 1 were processed as shown in table 2.
TABLE 2 word sequences after preprocessing
Referring to fig. 2 and 3, S2 in the present invention uses a pre-trained BERT model to vectorize the input text, and this stage takes the processed word sequence sets D 'and C' as input, and after the processing by BERT model, a text vector set is obtainedAndthe specific sub-process is as follows:
s21: determining text length
The neural network training needs to fix the input length, all microblog lengths are sequenced, the third quartile is taken as the maximum length corresponding to the data set used in the example, and the maximum length n is 51;
s22: fixed text length
The model training needs to be unified in size, the first n words of the text with the length exceeding n are cut off, the text with the length less than n is completed by using a special PAD mark symbol, mark symbols 'START' and 'END' need to be added at the beginning and the END of the text, and the updated text is recorded as D 'and C';
s23: BERT model vectorization
Inputting D 'and C' into a pre-trained BERT model, and taking [ CLS ] of the BERT]The expression vector is output as a text, the model structure is specifically shown in fig. 3, and a new text vector set can be obtained after the text is vectorizedAndwhereinAndthe dimensionalities of the reference points are 21300 by 768 and 10000 by 768 respectively, wherein 768 is the dimensionality output by the BERT model, 21300 corresponds to the number of microblog texts, and 10000 corresponds to the number of comment texts;
referring to fig. 4 and 5, S3 of the present invention learns the event representation vector using neural network training, because the comments often relate to key events in the topic, and therefore the information of the key events in the comments is extracted using a convolutional networkAnd then combining with the learning of a memory network to obtain an event vector matrix E. The specific sub-process is as follows:
s31 extracting event characteristics by convolution network
The key information in the review is extracted using a convolutional network. 3 groups of convolutions are set, 128 convolution kernels in each group have the corresponding convolution kernel widths of 3, 4 and 5, the convolution kernel length of 768 is consistent with the dimensionality of output of the BERT model, the sliding step length is 1, and the output corresponding to three groups of convolutions is marked as V1,V2,V3The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V1,V2,V3Performing maximal pooling by row to obtain 3 128-dimensional vectors, and combining the 3 128 vectorsThe vectors are spliced into a 384-dimensional vector, and finally, the feature vector containing the global key event information can be extracted
S32: memory network fetch event representation matrix E
Obtaining global key event characteristicsThen, the number k of events to be mined needs to be determined in advance. Firstly, toCarrying out k-time linear change to obtain event features of different semantic spacesWhereinWiParameters to be learned for the event detection model, WiDimension 384 x 384. Then, as shown in FIG. 5, willAs a query vector for the memory network,as external information, e is calculated according to the following formulaiThe vectorization of (c) represents:
wherein alpha ismThe m-th text is displayed with attention to the characteristics of the event, and the input is obtained according to the attentionThe key event information t contained iniFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to betai,WK,WV,WEAll the events are obtained by training an event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively;
according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E1,e2,…,ek]In this example, we take k to 20, so the dimension of matrix E is 20 × 768;
s33: decoder restores text set D'
And finally, restoring the input D' by using a decoder, wherein the decoder is based on a GRU structure, the output dimension is 4310 consistent with the total number of words, and the output word at the current moment is used as the input word at the next moment. The neural network needs a loss function to learn, which is shown below:
Lo=||EET-I||
L=0.5*La+0.5*Lo
wherein L represents the final loss function, represented by LaAnd LoTwo parts, LaIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, and q (w) represents the actual word probabilities of the text. L isoIs a loss function on the event vector that ensures irrelevancy between different events.
Referring to fig. 6, S4 of the present invention shows an event. Firstly, an event matrix E and a microblog vector set learned by an event detection modelAnd performing similarity calculation, dividing the microblogs into event clusters corresponding to the maximum similarity, completing event detection, and finally selecting the microblogs with the maximum similarity in the event clusters as an abstract for each event cluster to display the content of the event.
The specific sub-process is as follows:
s41: similarity calculation
According to the event matrix E obtained through learning, the similarity between each microblog and the event vector needs to be calculated, and the similarity is the basis for subsequently carrying out event detection and event summarization. The similarity calculation formula is as follows:
wherein alpha isi,jRepresenting microblogs diAnd event ejThe similarity of (2);
s42: event detection
The higher the similarity is, the higher the probability of the event to which the microblog belongs is, and the microblog is divided into the event cluster with the maximum event probability. Event set SiIs calculated asAs shown below, in the present example, 20 events are set to be mined, so that the value range of i is 1-20;
Si={dr|i=argmaxk(αr,k),1≤r≤|D|},
where r represents the number of the text.
Finally, the microblog sample example 1 in table 1 is included in event 15, whose content in the event text set is relevant for the black box to be found for delivery to france analysis, and sample example 2 is included in event 8, whose content in the event text set is relevant for the boeing share price drop.
S43: event summarization
For each set SiAnd a representative microblog is selected as an abstract of the event, so that the expression effect of the content of the event is improved. Remember hiAs a set of events SiThe abstract sentence of (1) is labeled, then hiThe calculation formula of (a) is as follows:
hi=argmaxk(αk,i),s.t.dk∈Si
since there are many events, it is not convenient to show them all, so the above mentioned summaries of event 8 and event 15 are chosen to be shown, as shown in table 3.
Table 3 event summary example
In summary, the present invention provides an event detection and summarization method based on a pre-training language model, and the above description is only used to help understand the method and its core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there are changes in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation of the present invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention shall be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such changes and modifications that fall within the scope and bounds of the appended claims, or equivalents of such scope and bounds.
Claims (8)
1. An event detection and summarization method based on a pre-training language model is characterized by comprising the following steps:
s1: preprocessing an input social media text, deleting unnecessary information in the text and segmenting the text, wherein the specific sub-process comprises the following steps:
s11: the input social media text set is marked as D, D ═ D1,d2,…,d|D|Acquiring a corresponding comment for each text in the D, and recording an obtained comment text set as C, where C is { C ═ C }1,c2,…,c|C|D | social media texts and C | comment texts are shared, and a regular expression is used for deleting short links and irrelevant information of other users in the texts;
s12: using a word segmentation tool to segment words of the text, and deleting low-frequency words to obtain corresponding word sequence sets D 'and C', wherein Where w represents the words in the sentence, m and k represent the lengths of the text in sets D 'and C', respectively, the subscripts of m and k represent the text numbers, m1And k1Denotes the length, w, of the first text in D 'and C', respectivelyi,jRepresenting the jth word in the ith text,andrespectively representing the m-th text in the first text in D' and C1And k1In individual words, i.e. D' and CThe last word in the first piece of text;
s2: and (3) carrying out vectorization representation on each word w in the input D 'and C' by using a BERT model as an encoder, and mining semantic and structural information of the text, wherein the specific sub-processes are as follows:
s21: the length n of the text in the determined D 'and C' sets;
s22: adding mark symbols to the beginning and the end of all texts in the D 'and C' sets, only reserving the first n words for the texts with the length larger than n, adding supplementary marks to the texts with the length smaller than n to ensure that the texts meet the requirement of the length of n, obtaining updated D 'and C',
D′={(w1,1,w1,2,…,w1,n),...,(w|D|,1,w|D|,2,w|D|,3,…,w|D|,n)}
C′={(w1,1,w1,2,…,w1,n),...,(w|C|,1,w|C|,2,w|C|,3,…,w|C|,n)};
s23: semantic and structural information of texts is mined by using a BERT model, vectorization representation of input D ' and C ' is obtained, and each text in a D ' set can obtain a corresponding vectorObtaining a set of vectors The subscript indicates the number of the text, and each text in the same C' set obtains the corresponding vectorObtaining a set of vectors
S3: vectorized-based text collectionAndthe event detection model obtains an event vector by using a convolutional neural network in combination with a memory network in combination with a loss function training, and the specific sub-process is as follows:
s31: will be provided withObtaining global characteristics of key events by inputting convolutional neural networkThe convolution formula isWhere w represents the weight matrix, h represents the size of the convolution kernel, b represents the bias, f represents the activation function, viRepresenting the characteristics of the event obtained by the convolution,
s32: will be provided withThe text expression vector in (1) is used as external information, and key event information in comments is combinedThe data is input into a memory network,learning to obtain an event representation matrix E ═ E1,e2,…,ek]Wherein e represents an event vector, k is a predetermined hyper-parameter, and represents that k events need to be mined;
s33: finally will beThe text vector is input into a decoder which takes GRU as a basic unit, an input sequence D' is restored, and the training of the event detection model is completed by combining a related preset loss function;
s4: representing matrices E and E based on events according to the trained event detection modelThe text in (1) represents vectors to calculate similarity, completes event detection and event summarization, and displays events, and the specific sub-processes are as follows:
s41: calculating according to the event representation matrix ESimilarity of text and event vector in (1)And normalizedWherein alpha isi,jRepresenting social media textAnd ejThe similarity of (2);
s42: event detection is realized based on the similarity alpha, specifically, texts are clustered according to the similarity alpha, and each category represents an event, such as the ith event SiCan be represented as Si={dr|i=argmaxk(αr,k) R is more than or equal to 1 and less than or equal to | D | }, r represents the number of the social media text, i represents the event number, i is more than or equal to 1 and less than or equal to ik;
S43: according to each event set SiThe text with the maximum similarity is selected as the abstract of the event content.
2. The method for detecting and abstracting events based on pre-trained language model as claimed in claim 1, wherein | D | is 21300 and | C | is 10000 in step S11.
3. The pre-trained language model based event detection and summarization method of claim 2 wherein the length n of the text in the D 'and C' sets determined in step S21 is 51; the step S22 is specifically to truncate n words before the text with the length exceeding n, complement the text with the length less than n by using a special "PAD" identifier, add the markers "START" and "END" at the beginning and END of the text, and record the updated text as D 'and C'; the [ CLS ] of BERT is taken out in the step S23]Outputting the expression vector as text, and vectorizing the text to obtain a new text vector setAndwhereinAnd21300 x 768, 10000 x 768, respectively, where 768 is the dimension output by the BERT model, 21300 the number of social media texts, 10000 the number of comment texts.
4. The pre-trained language model-based event detection and summarization method of claim 3, wherein the step S31 is specifically implemented by extracting key information in the comments by using a convolutional network, and settingSetting 3 groups of convolutions, wherein each group of convolutions has 128 convolution kernels, the corresponding convolution kernels have the widths of 3, 4 and 5, the convolution kernel length is 768, the convolution kernel length is consistent with the dimensionality of the output of the BERT model, the sliding step length is 1, and the output corresponding to the three groups of convolutions is marked as V1,V2,V3The dimensions are 128 × 9998, 128 × 9997, 128 × 9996 respectively. Then, according to the output of the convolution operation, for V1,V2,V3Performing maximal pooling by rows to obtain 3 128-dimensional vectors, splicing the 3 128 vectors into a 384-dimensional vector, and finally extracting the global features of the key events
5. The method for event detection and summarization based on a pre-trained language model as claimed in claim 4, wherein the step S32 is specifically performed after obtaining global features of key eventsThen, the number k of events to be mined needs to be determined in advance, and firstly, the number k of events to be mined is determinedCarrying out k-time linear change to obtain event features of different semantic spacesWhereinWiParameters to be learned for the event detection model, WiDimension 384 x 384; then, willAs a query vector for the memory network,as external information, e is calculated according to the following formulaiThe vectorization of (c) represents:
wherein alpha ismThe m-th text is displayed with attention to the characteristics of the event, and the input is obtained according to the attentionThe key event information t contained iniFinally, a gate control value beta of the information is calculated by using a full-connection network MLP, the full-connection network comprises three layers, the first two layers comprise 128 neurons, the last neuron comprises one neuron, and an event vector e is obtained according to betai,WK,WV,WEAll the events are obtained by training the event detection model, and the dimensions are 384 × 768, 768 × 384 and 384 × 768 respectively; according to the above calculation method, k event vectors can be obtained, and the k event vectors are spliced to obtain a final event representation matrix E ═ E1,e2,…,ek]And k is 20, so the dimension of the matrix E is 20 × 768.
6. The method for detecting and abstracting events based on a pre-trained language model as claimed in claim 5, wherein the step S33 is specifically to use a decoder to restore the input D', the decoder is based on a GRU structure, the output dimension is 4310 according to the total number of words, the output word at the current time is used as the input word at the next time, the event detection model needs a loss function to learn, and the loss function is as follows:
Lo=||EET-I||
L=0.5*La+0.5*Lo
wherein L represents the final loss function, represented by LaAnd LoTwo parts, LaIs a loss function for the decoder that ensures the accuracy of the decoder, w represents the decoder generated words, p (w) represents the decoder generated word probabilities, q (w) represents the actual word probabilities of the text; l isoIs a loss function on the event vector that ensures irrelevancy between different events.
7. The event detection and summarization method based on a pre-trained language model according to claim 6, wherein the greater the similarity in step S42, the higher the probability of the event that the social media text belongs to, the social media text is divided into event clusters with the highest probability of the event, and 20 events are set and mined, so that the value range of i is 1-20.
8. The pre-trained language model based event detection and summarization method of claim 7 wherein step S43 is specifically, for each set SiSelecting a representative social media text as the abstract of the event, improving the expression effect of the content of the event, and recording hiAs a set of events SiThe abstract sentence label ofhiThe calculation formula of (a) is as follows:
hi=argmaxk(αk,i),s.t.dk∈Si。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010661898.2A CN111966917B (en) | 2020-07-10 | 2020-07-10 | Event detection and summarization method based on pre-training language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010661898.2A CN111966917B (en) | 2020-07-10 | 2020-07-10 | Event detection and summarization method based on pre-training language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111966917A true CN111966917A (en) | 2020-11-20 |
CN111966917B CN111966917B (en) | 2022-05-03 |
Family
ID=73361680
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010661898.2A Active CN111966917B (en) | 2020-07-10 | 2020-07-10 | Event detection and summarization method based on pre-training language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111966917B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528650A (en) * | 2020-12-18 | 2021-03-19 | 恩亿科(北京)数据科技有限公司 | Method, system and computer equipment for pretraining Bert model |
CN112597269A (en) * | 2020-12-25 | 2021-04-02 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Stream data event text topic and detection system |
CN112597366A (en) * | 2020-11-25 | 2021-04-02 | 中国电子科技网络信息安全有限公司 | Encoder-Decoder-based event extraction method |
CN112699675A (en) * | 2020-12-30 | 2021-04-23 | 平安科技(深圳)有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN112836486A (en) * | 2020-12-09 | 2021-05-25 | 天津大学 | Group hidden-in-field analysis method based on word vectors and Bert |
CN112949318A (en) * | 2021-03-03 | 2021-06-11 | 电子科技大学 | Text position detection method based on text and user representation learning |
CN112966115A (en) * | 2021-05-18 | 2021-06-15 | 东南大学 | Active learning event extraction method based on memory loss prediction and delay training |
CN113254636A (en) * | 2021-04-27 | 2021-08-13 | 上海大学 | Remote supervision entity relationship classification method based on example weight dispersion |
CN113434632A (en) * | 2021-06-25 | 2021-09-24 | 平安科技(深圳)有限公司 | Text completion method, device, equipment and storage medium based on language model |
CN113434662A (en) * | 2021-06-24 | 2021-09-24 | 平安国际智慧城市科技股份有限公司 | Text abstract generation method, device, equipment and storage medium |
CN113688230A (en) * | 2021-07-21 | 2021-11-23 | 武汉众智数字技术有限公司 | Text abstract generation method and system |
CN113806528A (en) * | 2021-07-07 | 2021-12-17 | 哈尔滨工业大学(威海) | Topic detection method and device based on BERT model and storage medium |
CN114357022A (en) * | 2021-12-23 | 2022-04-15 | 北京中视广信科技有限公司 | Media content association mining method based on event relation discovery |
CN116050383A (en) * | 2023-03-29 | 2023-05-02 | 珠海金智维信息科技有限公司 | Financial product sales link flyer call detection method and system |
CN117670571A (en) * | 2024-01-30 | 2024-03-08 | 昆明理工大学 | Incremental social media event detection method based on heterogeneous message graph relation embedding |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170116054A1 (en) * | 2013-12-02 | 2017-04-27 | Qbase, LLC | Event detection through text analysis using dynamic self evolving/learning module |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN111026861A (en) * | 2019-12-10 | 2020-04-17 | 腾讯科技(深圳)有限公司 | Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium |
-
2020
- 2020-07-10 CN CN202010661898.2A patent/CN111966917B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170116054A1 (en) * | 2013-12-02 | 2017-04-27 | Qbase, LLC | Event detection through text analysis using dynamic self evolving/learning module |
CN110147452A (en) * | 2019-05-17 | 2019-08-20 | 北京理工大学 | A kind of coarseness sentiment analysis method based on level BERT neural network |
CN111026861A (en) * | 2019-12-10 | 2020-04-17 | 腾讯科技(深圳)有限公司 | Text abstract generation method, text abstract training method, text abstract generation device, text abstract training device, text abstract equipment and text abstract training medium |
Non-Patent Citations (2)
Title |
---|
GUANDAN CHEN等: "An Encoder-Memory-Decoder Framework for Sub-Event Detection in Social Media", 《2018 ASSOCIATION FOR COMPUTING MACHINERY》 * |
施喆尔等: "基于语言模型及循环卷积神经网络的事件检测", 《厦门大学学报(自然科学版)》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597366B (en) * | 2020-11-25 | 2022-03-18 | 中国电子科技网络信息安全有限公司 | Encoder-Decoder-based event extraction method |
CN112597366A (en) * | 2020-11-25 | 2021-04-02 | 中国电子科技网络信息安全有限公司 | Encoder-Decoder-based event extraction method |
CN112836486A (en) * | 2020-12-09 | 2021-05-25 | 天津大学 | Group hidden-in-field analysis method based on word vectors and Bert |
CN112836486B (en) * | 2020-12-09 | 2022-06-03 | 天津大学 | Group hidden-in-place analysis method based on word vectors and Bert |
CN112528650A (en) * | 2020-12-18 | 2021-03-19 | 恩亿科(北京)数据科技有限公司 | Method, system and computer equipment for pretraining Bert model |
CN112528650B (en) * | 2020-12-18 | 2024-04-02 | 恩亿科(北京)数据科技有限公司 | Bert model pre-training method, system and computer equipment |
CN112597269A (en) * | 2020-12-25 | 2021-04-02 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Stream data event text topic and detection system |
CN112699675A (en) * | 2020-12-30 | 2021-04-23 | 平安科技(深圳)有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN112699675B (en) * | 2020-12-30 | 2023-09-12 | 平安科技(深圳)有限公司 | Text processing method, device, equipment and computer readable storage medium |
CN112949318A (en) * | 2021-03-03 | 2021-06-11 | 电子科技大学 | Text position detection method based on text and user representation learning |
CN112949318B (en) * | 2021-03-03 | 2022-03-25 | 电子科技大学 | Text position detection method based on text and user representation learning |
CN113254636A (en) * | 2021-04-27 | 2021-08-13 | 上海大学 | Remote supervision entity relationship classification method based on example weight dispersion |
CN112966115A (en) * | 2021-05-18 | 2021-06-15 | 东南大学 | Active learning event extraction method based on memory loss prediction and delay training |
CN113434662B (en) * | 2021-06-24 | 2022-06-24 | 平安国际智慧城市科技股份有限公司 | Text abstract generating method, device, equipment and storage medium |
CN113434662A (en) * | 2021-06-24 | 2021-09-24 | 平安国际智慧城市科技股份有限公司 | Text abstract generation method, device, equipment and storage medium |
CN113434632A (en) * | 2021-06-25 | 2021-09-24 | 平安科技(深圳)有限公司 | Text completion method, device, equipment and storage medium based on language model |
CN113806528A (en) * | 2021-07-07 | 2021-12-17 | 哈尔滨工业大学(威海) | Topic detection method and device based on BERT model and storage medium |
CN113688230A (en) * | 2021-07-21 | 2021-11-23 | 武汉众智数字技术有限公司 | Text abstract generation method and system |
CN114357022A (en) * | 2021-12-23 | 2022-04-15 | 北京中视广信科技有限公司 | Media content association mining method based on event relation discovery |
CN114357022B (en) * | 2021-12-23 | 2024-05-07 | 北京中视广信科技有限公司 | Media content association mining method based on event relation discovery |
CN116050383A (en) * | 2023-03-29 | 2023-05-02 | 珠海金智维信息科技有限公司 | Financial product sales link flyer call detection method and system |
CN117670571A (en) * | 2024-01-30 | 2024-03-08 | 昆明理工大学 | Incremental social media event detection method based on heterogeneous message graph relation embedding |
CN117670571B (en) * | 2024-01-30 | 2024-04-19 | 昆明理工大学 | Incremental social media event detection method based on heterogeneous message graph relation embedding |
Also Published As
Publication number | Publication date |
---|---|
CN111966917B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN108733742B (en) | Global normalized reader system and method | |
CN107729309B (en) | Deep learning-based Chinese semantic analysis method and device | |
CN107808011B (en) | Information classification extraction method and device, computer equipment and storage medium | |
US20120253792A1 (en) | Sentiment Classification Based on Supervised Latent N-Gram Analysis | |
CN111401061A (en) | Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN112256866B (en) | Text fine-grained emotion analysis algorithm based on deep learning | |
Suleiman et al. | Comparative study of word embeddings models and their usage in Arabic language applications | |
CN110263325A (en) | Chinese automatic word-cut | |
Xing et al. | A convolutional neural network for aspect-level sentiment classification | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN112905736A (en) | Unsupervised text emotion analysis method based on quantum theory | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN112287197A (en) | Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases | |
CN115659947A (en) | Multi-item selection answering method and system based on machine reading understanding and text summarization | |
CN113987175A (en) | Text multi-label classification method based on enhanced representation of medical topic word list | |
CN110674293B (en) | Text classification method based on semantic migration | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
Xu et al. | Research on Depression Tendency Detection Based on Image and Text Fusion | |
CN107729509B (en) | Discourse similarity determination method based on recessive high-dimensional distributed feature representation | |
CN115906824A (en) | Text fine-grained emotion analysis method, system, medium and computing equipment | |
CN116049349A (en) | Small sample intention recognition method based on multi-level attention and hierarchical category characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |