CN115269833A - Event information extraction method and system based on deep semantics and multitask learning - Google Patents
Event information extraction method and system based on deep semantics and multitask learning Download PDFInfo
- Publication number
- CN115269833A CN115269833A CN202210760202.0A CN202210760202A CN115269833A CN 115269833 A CN115269833 A CN 115269833A CN 202210760202 A CN202210760202 A CN 202210760202A CN 115269833 A CN115269833 A CN 115269833A
- Authority
- CN
- China
- Prior art keywords
- event
- text
- emergency
- classification
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 87
- 238000000034 method Methods 0.000 claims abstract description 71
- 239000013598 vector Substances 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 27
- 230000008569 process Effects 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000003064 k means clustering Methods 0.000 claims description 4
- 230000007547 defect Effects 0.000 abstract description 2
- 235000019580 granularity Nutrition 0.000 abstract 1
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000007704 transition Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000005180 public health Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an event information extraction method and system based on deep semantics and multitask learning, and belongs to the field of text information extraction. In order to overcome the defects of low accuracy rate, low recall rate and the like of the existing event information extraction technology, the invention mainly utilizes a pre-training language model to respectively carry out vector representation on articles on the granularities of chapter level, language segment level, sentence level, word level and the like, and obtains the main information of the event by carrying out event classification, event argument extraction and keyword extraction in sequence. The method achieves very high accuracy in the three aspects of event classification, event argument extraction and keyword extraction.
Description
Technical Field
The invention belongs to the field of text information extraction, and relates to a plurality of technical methods for extracting key words, event classification and event arguments from discourse-level data, in particular to an event information extraction method and system based on deep semantic representation and multi-task learning.
Background
In the field of natural language processing, text classification is widely used, such as: harmful or junk information identification and filtration, sequence marking, emotion classification, selective question answering, topic classification and the like. The general flow of the text classification task is as follows: 1. preprocessing a text; 2. text representation and feature selection; 3. and (4) event classification. Event classification tasks typically require that events be classified into one or several categories within a given classification hierarchy. According to the given category number, the event classification can be divided into two-classification and multi-classification problems; according to the classification model, the shallow learning model and the deep learning model can be classified.
The deep learning method can automatically learn text representations or features from data, and has better results than the traditional machine learning model based on statistics. The following methods are representative:
a. a recurrent neural network (ReNN) based approach: by predicting the probability distribution of the labels of each input sentence, the vector representation of the multi-word phrase can be learned;
b. multi-layer perceptron (MLP) based approach: for example, the Paragraph-Vec (Paragraph vector) model is a method widely used, and paragraphs are represented by vectors, and can be used as a storage of Paragraph topics and as an input of prediction of a downstream classifier.
c. A Recurrent Neural Network (RNN) based method: recurrent Neural Networks (RNNs) are able to capture long-range dependencies of text. The RNN language model learns the history information, taking into account the location information between all words that fit into the classification task. Each input word is represented by a particular vector using word embedding (word embedding) techniques. The word vectors are then fed into the RNN unit one by one. The output of the RNN unit is the same dimension as the input vector and is fed into the next hidden layer. The RNNs share parameters in the model, and the weight of each input vector is the same. When all input texts enter the RNN, the label predicts through the final output of the hidden layer.
d. Convolutional Neural Network (CNN) based methods: convolutional Neural Networks (CNN) are suitable for image classification, and convolutional filters can extract features of images. Unlike RNNs, CNNs can apply different convolution kernels to text sequences to model features of different granularity. Thus, CNN is used for many NLP tasks, including text classification. For text classification, the text needs to be represented as a vector similar to the image representation. First, the word vectors of the input text are spliced into a matrix, which is then fed into a convolutional layer containing several filters of different dimensions. Finally, the convolution layer results in a final vector representation of the text through the pooling layer and the fully-connected layer and gives a prediction tag. There is a text classification model TextCNN of a convolutional neural network proposed by y.kim (2014). The method uses static word vectors to learn only the parameters within the convolutional neural network.
e. Attention-based mechanism (Attention) method: CNN and RNN achieve better results on the text classification task. However, these are poorly interpretable, especially in classification errors, due to the non-readability of neural network parameters. Therefore, a more explanatory method based on an attention mechanism is applied to text classification, and a part of the text with large influence on classification can be visualized through the attention mechanism.
f. Transformer-based methods: in recent years, the technology of pre-training language models is rapidly developed, the pre-training language models are generally automatically learned by adopting an unsupervised method, and by constructing different pre-training tasks, the models can more effectively learn global semantic representation and remarkably improve the effect of NLP tasks including classification tasks. The common characteristics of the more typical pre-training models are that the model classification performance is greatly improved by pre-training on a large-scale corpus through an unsupervised pre-training task to learn language knowledge and then adapting a downstream classification task to the large model.
The keyword (subject word) extraction task is to extract a plurality of words which can express the content and the semantics of an article from the article, and for the Chinese keyword extraction task, a large-scale and high-quality Chinese subject word corpus is absent at present. Although the traditional TF-IDF (Term Frequency-Inverse Document Frequency) method can obtain some keywords in a text data set, the extraction accuracy and the recall rate are low; the topic model LDA (topic Dirichlet Allocation) method also extracts keywords with poor quality. And because the invention constructs a small data set by itself, more bias can be generated in the small data set by using the traditional method to influence the extraction effect. Therefore, the subject word extraction is converted into the similarity calculation problem, the vector representation of the words is obtained by using the pre-training language model, the more accurate and rich semantic information of the words is obtained by exerting the advantages of the pre-training model, and the subject word which can most express the meaning of the article is extracted from the article.
The event extraction task is to adopt natural language processing technology to identify specific categories of events from unstructured text and determine and extract relevant information, and to present the events to a user in a structured form. The event extraction is generally divided into two subtasks, namely, the automatic extraction and classification of events, namely, the classification of the events is judged; the second is the extraction of event elements, namely the extraction and classification of key elements (arguments) involved in the event. Most of the existing event extraction tasks are directed to open ACE2005 and KBP2015 corpora, wherein ACE2005 defines 8 event categories (life, moment, conflict, contact, etc.) and 33 seed categories (born, marry, injury, transport, attack, etc.); for the argument classification, different categories of events include different argument categories, and all argument categories include 35 categories (Agent, person, time, place, etc.).
Existing event extraction work can be divided into a mode-based method and a machine learning-based method from the perspective of an adopted method; from the perspective of subtask association, the method can be divided into a pipeline learning method and a joint learning method; from the perspective of text-oriented level, the method can be divided into sentence-level event extraction and chapter-level event extraction.
Most of the traditional event extraction methods adopt mode-based methods, wherein the mode is divided into a grammar mode and a semantic mode. The main idea of the grammar mode is to extract events through grammar components of sentences, and the typical method is to utilize the grammar relation of trigger words and event arguments to complete the event extraction; the semantic mode is mainly identified through the semantic relation between the event and the argument, for example, the detection of the event argument is completed through the description relation of the ontology knowledge base. The pattern-based approach has good interpretability, but requires a lot of expert knowledge and is poor in domain mobility.
With the improvement of computing power and the appearance of high-quality data sets, a representation learning method adopting a deep neural network becomes a mainstream method at present. With the increasing attention of language model tasks in recent years, people begin to train language models to help downstream tasks to improve performance, and the language models such as BERT and ELMo have good effects on event extraction tasks.
The multi-task learning is a deep learning method for a plurality of related but not identical tasks, and different tasks share part of parameters of a model in learning, so that the purposes of reducing overfitting risks, enhancing generalization performance and simultaneously improving performance on a plurality of target tasks are achieved. Information shared by multiple tasks is a shallow representation of input information, and can be divided into a hard (hard) sharing mechanism and a soft (soft) sharing mechanism according to a parameter sharing mode.
Hard parameter sharing (hard parameter sharing) means that bottom layer feature extraction parameters of a plurality of task models are completely shared, and parameters directly connected with tasks at the top layer are independent among the models; soft parameter sharing (soft parameter sharing) means that bottom layer parameters of each task model are not completely shared nor independent, but are fused by a certain method and then are accessed into a top layer model so as to achieve the purpose of multi-task learning.
Disclosure of Invention
In order to overcome the defects of low accuracy, low recall rate and the like of the existing event information extraction technology (including event classification, event extraction and keyword extraction tasks), the invention aims to provide an event information extraction method and system based on deep semantic representation and a multi-task learning model.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
an event information extraction method based on deep semantics and multitask learning comprises the following steps:
constructing an event classification module, segmenting the paragraphs of the text of the emergency event by using the module, obtaining the vector representation of each paragraph by using a pre-training language model BERT, fusing the vector representations of each paragraph, and obtaining the event category by using a linear classifier;
constructing an event argument extraction module, utilizing the module to obtain a label sequence corresponding to each character of a text sequence for an emergency text by using BERT, then modeling a relation between labels by using a conditional random field CRF, decoding by using a Viterbi algorithm to obtain entity labels, and classifying the entities by using a linear classifier to obtain corresponding event arguments and argument classes;
constructing a keyword extraction module, acquiring candidate subject terms of the emergency text by using the module, acquiring vector representations of sentences and the candidate subject terms through BERT, clustering each vector representation into a plurality of semantic centers, finding out the candidate subject terms with the vector representations closest to the semantic centers, and extracting keywords according to the similarity between the candidate subject terms and the emergency text;
the method comprises the steps that an event classification module, an event argument extraction module and a keyword extraction module which correspond to emergency text training are manually marked with event categories, argument categories and keywords, and the three modules are used for processing emergency texts to be extracted after training is completed, so that the event categories, the event arguments and the keywords are obtained and serve as extracted event information.
Further, the emergency text is subjected to data cleaning in advance before being input into the three modules.
Further, in the event classification module process, the last hidden vector of each paragraph start [ CLS ] is obtained as the vector representation of the paragraph.
Further, in the event classification module process, the vector representations of each paragraph are averaged and then fused into one vector representation.
Further, in the event classification module processing, the classification result of each paragraph is obtained through a linear classifier, then the classification results of all paragraphs are counted and voted, and the classification result with the largest number of votes is obtained as the event category of the text of the emergency event.
Further, in the processing of the keyword extraction module, stop words of the input emergency text are removed by using a regular expression through data preprocessing.
Further, in the keyword extraction module process, if the length of the emergency text exceeds the maximum input length of the BERT, the text is divided into a plurality of language segments or sentences.
Further, in the processing of the keyword extraction module, a K-means clustering algorithm is adopted for clustering.
Furthermore, in the processing of the keyword extraction module, the similarity between the candidate subject term and the emergency text refers to the cosine similarity between vector representations of the candidate subject term and the emergency text, and the similarity between the candidate subject term and the emergency text is evaluated by calculating the cosine value of an included angle between the two vectors.
Further, when the three modules are trained, loss functions of event classification, event argument extraction and event keyword extraction are respectively calculated, bottom BERT model parameters are shared by adopting a hard parameter sharing mechanism, and the total loss function is calculated by adjusting the weights of the three loss functions.
An event information extraction system based on deep semantic and multitask learning, comprising:
the event classification module is used for segmenting paragraphs of the text of the emergency event, obtaining vector representation of each paragraph by using a pre-training language model BERT, fusing the vector representation of each paragraph, and obtaining the event category through a linear classifier;
the event argument extraction module is used for obtaining a label sequence corresponding to an input text sequence for an emergency text by using BERT, then modeling a relation between labels by using a conditional random field CRF, obtaining entity labels by using a Viterbi algorithm, and classifying the entities by using a linear classifier to obtain corresponding event arguments and argument categories;
the keyword extraction module is used for acquiring candidate subject terms of the emergency text, then acquiring vector representations of sentences and the candidate subject terms through BERT, clustering each vector representation into a plurality of semantic centers, then finding out candidate subject terms with the vector representations closest to the semantic centers, and extracting keywords according to the similarity between the candidate subject terms and the emergency text;
the event classification module, the event argument extraction module and the keyword extraction module are used for training the emergency text which is manually marked with the event category, the argument category and the keywords respectively, and processing the emergency text to be extracted after the training is finished to obtain the event category, the event argument and the keywords as extracted event information.
The method mainly utilizes a pre-training language model to respectively carry out vector representation on the articles on granularity such as chapter level, language segment level, sentence level, word level and the like, and obtains the main information of the event by carrying out event classification, event argument extraction and keyword extraction in sequence. The embodiment of the invention verifies that the method achieves very high accuracy in the three aspects of event classification, event argument extraction and keyword extraction, and proves the effectiveness of the technical scheme of the invention.
Drawings
Fig. 1 is a schematic flow chart of an event information extraction method in an embodiment of the present invention.
FIG. 2 is a schematic diagram of a module training phase multitask learning process.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The embodiment specifically discloses an event information extraction method based on deep semantics and multi-task learning, and as shown in fig. 1, the method includes the following steps:
1) Firstly, collecting report texts of network news or social media emergencies, cleaning the report texts, and removing some messy codes, illegal characters, repeated characters, irrelevant contents of non-Chinese, irrelevant contents (such as advertisements) in a webpage and the like in the original texts;
2) In order to classify the event, paragraph segmentation processing is carried out on the whole text where the event is located, the length of each paragraph can be adapted to the size of a model training batch and the video memory capacity of a machine, vector representation of each paragraph is obtained by using BERT, and the vector representations of all paragraphs are integrated to access a classification layer for classifying the emergency;
3) Acquiring vector representation of text characters according to BERT to extract event arguments, modeling the relationship between different event argument labels by using a conditional random field, accessing a CRF conditional random field after the BERT to classify the labels of the characters, decoding by using a Viterbi algorithm, and extracting the event arguments (characters, time, places, trigger words and the like) in the text;
4) And extracting the keywords by using a similarity calculation method. Taking the vector representation of the article population as a semantic center, carrying out similarity calculation on the BERT vector representation of the article words and the general semantic center of the article, screening out a plurality of words with the highest similarity according to a threshold value, and adding the words into a keyword list of the article;
5) And adding an event classification module, an event argument extraction module and a keyword extraction module after the pre-training model according to the vector representation of the text for multi-task learning. The task of the event classification module is to determine the event type (such as natural disaster type, social security type, liability accident type and public health type) of the text of the emergency, the task of the event argument extraction module is to extract some arguments (characters, time, places and trigger words) in the text, and the keyword extraction module extracts subject words in the text to form a keyword list so as to obtain the overall analysis result of the event.
The method is specifically described as follows:
1) The method uses BERT and a linear classifier to perform four classifications of events.
2) The method extracts event arguments in text using BERT, linear classifiers, and viterbi decoding methods.
3) The method uses a K-means clustering algorithm and a cosine similarity calculation method to obtain keywords of the text.
4) The method uses a multi-task learning method of a hard parameter sharing mechanism to learn on the above 3 tasks and enhance the performance of the module on the 3 tasks.
5) In order to verify the event analysis effect of the method, a data set of 4 types of emergency events is constructed in the embodiment, and the specific conditions are as shown in table 1 below:
TABLE 1 data set details statistics
Categories | Total number of | Natural disasters | Social security | Liability accident | Public health |
Number of | 1132 | 415 | 348 | 200 | 169 |
The method aims at extracting information of an emergency, and aiming at 3 subtasks of extracting event information, the method is completed by an event information extraction system which comprises three modules of corresponding event classification, event argument extraction and keyword extraction and is based on deep semantic and multi-task learning, and the three modules use a multi-task learning method in training to enhance the comprehensive performance of a model. The system also comprises a preprocessing module for cleaning and preprocessing the data of the acquired emergency text.
Module 1: and an event classification module. For the event multi-classification task, the words of the text are converted into vector representation through BERT, and the segment head representation symbol [ CLS ] is given as the total vector representation of the event text, and the vector integrates the semantic information of the whole article, so that the semantic information can be used as the input of a downstream classifier to improve the classification accuracy. In this data set, the article text length is from around 50 words to over 1000 words, spanning two orders of magnitude. Since BERT input limits tokens to 512, long text cannot be processed directly. Because the classification performed by the module belongs to the topic classification problem, the long-interval context of each article does not have great influence on the category to which the article belongs, and the classification is mainly based on whether some events and keywords belonging to the category appear or not, so that the inventor thinks that the method for performing segmentation processing and then performing aggregation representation on the article does not have great influence on the performance of BERT. In order to enable a module to process texts with different lengths and scales, firstly, dividing the whole article according to small paragraphs of about 50 characters, taking the paragraph text as the input of a model, and acquiring the last hidden vector of a start character [ CLS ] of the paragraph text as the vector representation of the paragraph; and then, the vector representation of each paragraph is averaged and fused into a vector representation, and the vector representation is used as the vector representation of the whole article and is input into a model downstream classifier to obtain a classification result.
Through the breakthrough of the key technology, the method for expressing texts by applying BERT-based segmentation polymerization in different lengths is effectively solved, and the text topic classification in any length is processed under the condition of not performing truncation processing; and the BERT is applied to migrate the language representation knowledge on the large-scale unmarked corpus to the small-scale marked data, so that a good classification effect on a small data set is realized. As shown in table 2 below:
table 2 text event classification result test statistical table
Categories | Total data | Natural disasters | Social security | Liability accident | Public health |
Rate of accuracy | 86.31% | 88.24% | 73.68% | 83.33% | 100.00% |
Recall rate | 85.94% | 93.75% | 87.50% | 62.50% | 100.00% |
And (3) module 2: and an event argument extraction module. The vector representation of the characters of the entities is first obtained using BERT, while it is necessary to establish the relationship between the characters of the entities, since one entity usually comprises two or more characters. The module models the relation between different labels through a conditional random field. Wherein, the characters refer to characters in the original text. For the label, in this embodiment, the label is given by a model for each character in the original text according to the BIESO system, where B denotes the beginning of an entity, I denotes the middle of an entity, E denotes the end of an entity, S denotes the entity of a single character, and O denotes others. For example, "Xiaoming Olympic Games", the label that it should generate is "BESOBIE", where "Xiaoming BE" represents a (person) argument, "View SBE" is an event trigger, and "Olympic Games BIE" is an (thing) argument; "O" is other. The legal continuous marks of 'Xiaoming BE', 'seeing S', 'Olympic Association BIE' and the like are called as entities, and then a subsequent classifier identifies the arguments of the entities, such as 'people', 'time', 'place' and the like, and if the arguments are not the required events, the entities are classified as 'other'. Because these labels have internal rules, for example, BIE, BIIE, BE, O, S, etc. are legal labels, and BBE, IE, etc. are illegal labels according to the label definition, but the labels generated by the neural network do not necessarily conform to these rules, it is necessary to decode the optimal labels of the sequence by using the CRF network and the viterbi algorithm, so that these labels conform to the semantics and label rules.
The conditional random field is specifically formulated as follows:
wherein score (X, y) represents a path total score of tag sequence y under text sequence X,is a state (i.e., label) yiTo yi+1The score of the state transition of (a),a score is generated for the states and n represents the number of states. The text sequence X represents a sequence of original text, e.g. X = "minimal eye on olympics", y is a sequence of labels given by a neural network (BERT) corresponding to X, e.g. "BESOBIE". Here, the tag change process of "B → E → S → O → B → I → E" in the sequence y is modeled as a state transition process in which the generation of a state is mainly influenced by the semantics recognized by the upstream neural network, giving a generation probability; the state transition is mainly influenced by the labeling rule of the CRF model modeling, and the state transition probability is given. Both affect the overall score of an annotated sequence. Since the y-label sequence given by the neural network may be wrong, the path scores are given to different sequences and the correct label sequence is calculated by decoding with the viterbi algorithm. Regarding the state transition score, for example, according to BIESO, the rule is that it is reasonable to transition the B state to the I state or the E state, and that it is illegal to transition B to the O or S state, the corresponding state transition score will be low.
And decoding by adopting a Viterbi algorithm when the label sequence is obtained. The viterbi algorithm is a dynamic programming algorithm used to find the path that is most likely to generate the observation event sequence, and finally the final tag sequence is obtained. And after the correct label sequence is obtained, classifying the entity labels by a classifier according to the argument categories.
In the test set, the event argument extraction reaches more than 90% of accuracy.
And a module 3: and a keyword extraction module. There are a lot of redundant stop words in the natural text of chinese, the stop words usually include two kinds of functional words and vocabulary words, such as general vocabulary, because vocabulary, etc., these vocabularies appear very commonly in the text, compare with other words, have no actual meaning, are not the target of this module to extract the subject word, too many stop words will reduce the accuracy rate of extracting the subject word. Therefore, the embodiment first needs to perform data preprocessing, and use the regular expression to remove the stop word, so as to improve the accuracy of extracting the subject word. And then constructing a candidate subject word list through text feature extraction. And then, adopting BERT to obtain semantic representation of sentences and candidate subject word texts, and obtaining semantic similarity and paraphrase identification of different subject words. For longer length articles, the complete semantic vector representation cannot be directly obtained using BERT (BERT has a limit to its input length). The module divides a text into a plurality of language segments or sentences by adopting a method similar to that in the event classification module, obtains vector representations of the language segments or the sentences respectively, clusters each vector representation by adopting a K-means clustering algorithm to obtain K semantic centers, and then carries out the following similarity calculation on a plurality of candidate subject terms which are expressed by the vector and are closest to the K semantic centers.
In order to calculate the similarity between the candidate subject term and the original text, the present embodiment uses the cosine similarity between the vectors, and evaluates the similarity by calculating the cosine value of the included angle between the two vectors, so as to finally obtain the subject term, i.e. the keyword, which can most express the text content, and the specific formula is as follows:
wherein the cosine similarity (u, v) vector represents the cosine similarity of u, v.
In the test set, the keyword extraction of the event reaches the accuracy rate of more than 80%.
Multi-task learning: as shown in fig. 2, in the training process, the above three modules are subjected to multi-task learning, and the loss function of each task is respectively designed as:
wherein Jclass,Jarg,JkeyRespectively representing loss functions of 3 tasks of event classification, event argument extraction and event keyword extraction, wherein M, N and K respectively represent the total category number, argument number and keyword number of the event,anda real tag representing the data is then identified,andthe prediction label of the module is represented (the base of log in the above formula is 2). A hard parameter sharing mechanism is adopted to enable 3 tasks to share the parameters of the bottom BERT model, and the total loss function is designed as follows:
J=λJclass+βJarg+(1-λ-β)Jkey
where J is the total loss function, λ and β are the hyperparameters that measure the weight of each loss function, adjusted on the validation set.
Through the embodiment, the effectiveness of the method in the emergency information extraction task is proved, and the method has good effects in three tasks of emergency classification, event argument and keyword extraction.
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. An event information extraction method based on deep semantics and multitask learning is characterized by comprising the following steps:
constructing an event classification module, segmenting the paragraphs of the text of the emergency event by using the module, obtaining the vector representation of each paragraph by using a pre-training language model BERT, fusing the vector representations of each paragraph, and obtaining the event category by using a linear classifier;
constructing an event argument extraction module, utilizing the module to obtain a label sequence corresponding to each character of a text sequence for an emergency text by using BERT, then modeling a relation between labels by using a conditional random field CRF, decoding by using a Viterbi algorithm to obtain entity labels, and classifying the entities by using a linear classifier to obtain corresponding event arguments and argument classes;
constructing a keyword extraction module, acquiring candidate subject terms of the emergency text by using the module, acquiring vector representations of sentences and the candidate subject terms through BERT, clustering each vector representation into a plurality of semantic centers, finding out the candidate subject terms with the vector representations closest to the semantic centers, and extracting keywords according to the similarity between the candidate subject terms and the emergency text;
the method comprises the steps that an event classification module, an event argument extraction module and a keyword extraction module which correspond to emergency text training are manually marked with event categories, argument categories and keywords, and the three modules are used for processing emergency texts to be extracted after training is completed, so that the event categories, the event arguments and the keywords are obtained and serve as extracted event information.
2. The method of claim 1, wherein the incident text is pre-scrubbed prior to being input into the three modules.
3. The method of claim 1, wherein in event classification module processing, a last hidden vector of each paragraph start [ CLS ] is obtained as a vector representation of the paragraph; the vector representations of each paragraph are averaged and then fused into one vector representation.
4. The method as claimed in claim 1, wherein in the event classification module process, the classification result of each paragraph is obtained through a linear classifier, then the classification results of all paragraphs are counted and voted, and the classification result which obtains the most votes is taken as the event category of the text of the emergency.
5. The method of claim 1, wherein stop words of the input emergency text are removed using a regular expression through data preprocessing in the keyword extraction module process.
6. The method of claim 1, wherein the clustering is performed using a K-means clustering algorithm in the keyword extraction module process.
7. The method of claim 1, wherein in the keyword extraction module process, if the length of the emergency text exceeds the maximum input length of BERT, the text is divided into several language segments or sentences.
8. The method of claim 1, wherein the similarity between the candidate subject term and the emergency text in the keyword extraction module process is cosine similarity between vector representations of the candidate subject term and the emergency text, and the similarity between the candidate subject term and the emergency text is evaluated by calculating cosine values of an included angle between the two vectors.
9. The method of claim 1, wherein when the three modules are trained, loss functions of event classification, event argument extraction and event keyword extraction are respectively calculated, a hard parameter sharing mechanism is adopted to share parameters of the bottom-layer BERT model, and the total loss function is calculated by adjusting weights of the three loss functions.
10. An event information extraction system based on deep semantic and multitask learning, characterized by comprising:
the event classification module is used for segmenting paragraphs of the text of the emergency event, obtaining vector representation of each paragraph by using a pre-training language model BERT, fusing the vector representation of each paragraph, and obtaining the event category through a linear classifier;
the event argument extraction module is used for obtaining a label sequence corresponding to an input text sequence for an emergency text by using BERT, then modeling a relation between labels by using a conditional random field CRF, obtaining entity labels by using a Viterbi algorithm, and classifying the entities by using a linear classifier to obtain corresponding event arguments and argument categories;
the keyword extraction module is used for acquiring candidate subject terms of the emergency text, then acquiring vector representations of sentences and the candidate subject terms through BERT, clustering each vector representation into a plurality of semantic centers, then finding out the candidate subject terms with the vector representations closest to the semantic centers, and extracting keywords according to the similarity between the candidate subject terms and the emergency text;
the event classification module, the event argument extraction module and the keyword extraction module are used for training through the emergency text which is manually marked with the event type, the argument type and the keywords respectively, and processing the emergency text to be extracted after training is completed to obtain the event type, the event argument and the keywords as extracted event information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210760202.0A CN115269833B (en) | 2022-06-29 | 2022-06-29 | Event information extraction method and system based on deep semantics and multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210760202.0A CN115269833B (en) | 2022-06-29 | 2022-06-29 | Event information extraction method and system based on deep semantics and multi-task learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115269833A true CN115269833A (en) | 2022-11-01 |
CN115269833B CN115269833B (en) | 2024-08-16 |
Family
ID=83763404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210760202.0A Active CN115269833B (en) | 2022-06-29 | 2022-06-29 | Event information extraction method and system based on deep semantics and multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115269833B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115983255B (en) * | 2023-03-21 | 2023-06-02 | 深圳市万物云科技有限公司 | Emergency management method, device, computer equipment and storage medium |
CN117390131A (en) * | 2023-07-04 | 2024-01-12 | 无锡学院 | Text emotion classification method for multiple fields |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361037A (en) * | 2014-10-29 | 2015-02-18 | 国家计算机网络与信息安全管理中心 | Microblog classifying method and device |
CN110059181A (en) * | 2019-03-18 | 2019-07-26 | 中国科学院自动化研究所 | Short text stamp methods, system, device towards extensive classification system |
CN111797241A (en) * | 2020-06-17 | 2020-10-20 | 北京北大软件工程股份有限公司 | Event argument extraction method and device based on reinforcement learning |
CN113779227A (en) * | 2021-11-12 | 2021-12-10 | 成都数之联科技有限公司 | Case fact extraction method, system, device and medium |
CN114661881A (en) * | 2022-03-30 | 2022-06-24 | 中国科学院空天信息创新研究院 | Event extraction method, device and equipment based on question-answering mode |
-
2022
- 2022-06-29 CN CN202210760202.0A patent/CN115269833B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361037A (en) * | 2014-10-29 | 2015-02-18 | 国家计算机网络与信息安全管理中心 | Microblog classifying method and device |
CN110059181A (en) * | 2019-03-18 | 2019-07-26 | 中国科学院自动化研究所 | Short text stamp methods, system, device towards extensive classification system |
CN111797241A (en) * | 2020-06-17 | 2020-10-20 | 北京北大软件工程股份有限公司 | Event argument extraction method and device based on reinforcement learning |
CN113779227A (en) * | 2021-11-12 | 2021-12-10 | 成都数之联科技有限公司 | Case fact extraction method, system, device and medium |
CN114661881A (en) * | 2022-03-30 | 2022-06-24 | 中国科学院空天信息创新研究院 | Event extraction method, device and equipment based on question-answering mode |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115983255B (en) * | 2023-03-21 | 2023-06-02 | 深圳市万物云科技有限公司 | Emergency management method, device, computer equipment and storage medium |
CN117390131A (en) * | 2023-07-04 | 2024-01-12 | 无锡学院 | Text emotion classification method for multiple fields |
Also Published As
Publication number | Publication date |
---|---|
CN115269833B (en) | 2024-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dos Santos et al. | Deep convolutional neural networks for sentiment analysis of short texts | |
WO2019153737A1 (en) | Comment assessing method, device, equipment and storage medium | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
Zhang et al. | Sentiment Classification Based on Piecewise Pooling Convolutional Neural Network. | |
Chang et al. | Research on detection methods based on Doc2vec abnormal comments | |
CN109086265B (en) | Semantic training method and multi-semantic word disambiguation method in short text | |
CN115269833B (en) | Event information extraction method and system based on deep semantics and multi-task learning | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN111522908A (en) | Multi-label text classification method based on BiGRU and attention mechanism | |
CN110188195A (en) | A kind of text intension recognizing method, device and equipment based on deep learning | |
US12105748B2 (en) | Tutorial recommendation using discourse-level consistency and ontology-based filtering | |
CN113627151B (en) | Cross-modal data matching method, device, equipment and medium | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
Ranjan et al. | Document classification using lstm neural network | |
Islam et al. | Deep learning for multi-labeled cyberbully detection: Enhancing online safety | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN113535960A (en) | Text classification method, device and equipment | |
Kavitha et al. | A review on machine learning techniques for text classification | |
Ueno et al. | A spoiler detection method for japanese-written reviews of stories | |
Muzafar et al. | Machine Learning Algorithms for Depression Detection and Their Comparison | |
CN117235253A (en) | Truck user implicit demand mining method based on natural language processing technology | |
CN115906824A (en) | Text fine-grained emotion analysis method, system, medium and computing equipment | |
CN114817533A (en) | Bullet screen emotion analysis method based on time characteristics | |
CN107729509A (en) | The chapter similarity decision method represented based on recessive higher-dimension distributed nature | |
Liu et al. | Suggestion mining from online reviews usingrandom multimodel deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |