CN113901813A - Event extraction method based on topic features and implicit sentence structure - Google Patents

Event extraction method based on topic features and implicit sentence structure Download PDF

Info

Publication number
CN113901813A
CN113901813A CN202111178364.5A CN202111178364A CN113901813A CN 113901813 A CN113901813 A CN 113901813A CN 202111178364 A CN202111178364 A CN 202111178364A CN 113901813 A CN113901813 A CN 113901813A
Authority
CN
China
Prior art keywords
word
event
sentence
sequence
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111178364.5A
Other languages
Chinese (zh)
Inventor
黄婉华
漆桂林
高桓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111178364.5A priority Critical patent/CN113901813A/en
Publication of CN113901813A publication Critical patent/CN113901813A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an event extraction method based on topic features and an implicit sentence structure, which is mainly used for presenting an unstructured text containing event information in a structured form and has wide application in the fields of automatic abstracting, automatic question answering, information retrieval and the like. According to the method, firstly, the topic information of a document level is introduced into an event extraction model with sentence level topic characteristics of the document obtained by combining BERT and LDA; secondly, extracting syntax information hidden in the embedded representation of the BERT words, and performing combined modeling on the extraction process and event extraction, thereby introducing important syntax information for the event extraction while avoiding the problem of error accumulation; and finally, the model extracts a plurality of trigger words in a single sentence and extracts the element roles of the entity in a plurality of events by using a sequence labeling method based on Bi-LSTM and cascading CRF.

Description

Event extraction method based on topic features and implicit sentence structure
Technical Field
The invention belongs to the field of information extraction, and relates to an event extraction method based on topic features and an implicit sentence structure.
Background
With the development and popularization of the internet, millions of data sources are published in the form of news articles, blogs, papers and the like every day, more and more experience knowledge is stored in documents, and as the traditional knowledge storage mode brings the problem of low retrieval efficiency, how to manage and utilize the data gradually becomes a core problem in the field of natural language processing. With research and research findings, the structured storage mode can effectively improve the ability of people to retrieve and collect experience knowledge. In order for machines to better understand human language, techniques for automatically organizing and processing data under study of information extraction tasks become indispensable. The basic goal of the information extraction task is to automatically extract and store information from unstructured or semi-structured machine-readable documents and other sources of electronic representations in a structured form to enable the organization, management, and analysis of large amounts of textual information on the internet.
The event extraction is one of core tasks of information extraction, and the main aim of the event extraction is to extract structured event information from unstructured texts, and the event extraction plays an important role in information retrieval and the structure of a case map. Existing event extraction methods can be broadly classified into pipeline methods and combined methods. The pipeline method has the problem of error accumulation, and most of recent work adopts a combined method to extract events. However, most sentence-level event extraction joint methods lack the overall information of the text so as not to well deal with the ambiguity problem of the trigger word, and document-level joint methods have the problem of complicated modeling; in addition, because the relationship between the event trigger words and the event elements in the sentences is close, the event extraction task depends on syntactic characteristics, however, only a few methods introduce syntactic information in the event extraction, and the syntactic analysis which depends on the pre-training tools still causes error accumulation on the event extraction; in related data sets and real-world applications, the situation that a sentence contains a plurality of events or event elements are overlapped is quite common, but most methods only consider single events and single element roles, and a large amount of event information is lost.
In order to improve the problems, the invention provides an event extraction joint method based on topic characteristics and an implicit sentence structure. The method firstly introduces document-level subject information for a sentence-level event extraction model by combining BERT and LDA, thereby improving the ambiguity problem of trigger words; secondly, syntax information hidden in the embedded representation of the BERT words is extracted, and the extraction process and the event extraction are subjected to combined modeling, so that not only is important syntax information introduced for the event extraction, but also the problem of error accumulation is avoided; and finally, the model can extract a plurality of trigger words in a single sentence and extract element roles of the entity in a plurality of events, so that the problem of overlapping of multiple events and event elements is solved. The method has the advantages of introducing the topic characteristics, the implicit syntactic characteristics and the joint modeling, so that the method for extracting the events based on the topic characteristics and the implicit sentence structure is constructed, the problem of error accumulation is avoided, the topic characteristics and the implicit sentence structure information are introduced, the quality of event extraction can be effectively improved, and the method has great research significance.
Disclosure of Invention
The invention provides an event extraction combined method, which comprises the following steps: for the ambiguity problem of the trigger word, on one hand, semantic structure information of the trigger word is obtained based on the expression of the sentence, on the other hand, topic distribution expression is obtained through topic modeling, and the whole context information of the introduced document is extracted for the event so as to achieve the effect of disambiguation of the trigger word; for the problem of error accumulation possibly caused by introducing syntactic characteristics, researching a method for extracting sentence structure information hidden in the embedding of the BERT words, establishing a joint model with event extraction, and avoiding the influence of error accumulation while introducing syntactic information; for the problem of multiple events and event element overlap, the model of the invention can identify multiple events in a single sentence and determine the element role a candidate entity plays in multiple events. Improvements to the above challenges can be accomplished by these methods to enhance the effectiveness of event extraction.
The invention utilizes a pre-training language model BERT to extract implicit sentence structural features and applies the implicit sentence structural features to a process of performing combined extraction with a subtask of event extraction. Firstly, extracting sentence structure information implied in a BERT result; then, extracting event trigger words in a cascading manner by utilizing a CRF model; then, introducing implicit sentence structure information into the process of extracting event elements by utilizing a Bi-LSTM model; and finally, defining a loss function of model joint training, and jointly optimizing each task to learn the optimal parameters of the model.
An event extraction method based on topic features and implicit sentence structures, comprising the following steps:
1) data processing and topic feature extraction: reconstructing an original data set into a format suitable for the model of the invention, extracting the theme characteristics of each sample invention file in the read data set, and then carrying out sentence segmentation on the sample invention files by using a sentence segmentation tool in an NLTK package to obtain sample sentences;
2) extracting an implicit sentence structure: for each sample sentence, firstly, utilizing a language model Bert to obtain word embedding in the sentence as the context characteristics of the sentence, and then utilizing a shielding mechanism to calculate the mutual influence degree between all components in the sentence for the word embedding sequence as the implicit sentence structure characteristics for a subsequent event extraction combination method;
3) the event trigger word extraction module based on the cascade CRF adopts a cascade sequence labeling method to decompose an extraction task into two tasks of boundary labeling and type judgment;
4) an event element extraction module for fusing the Bi-LSTM into the syntactic information is utilized, data in an influence matrix is introduced in the forward and reverse recursion processes, and corresponding links are established between the current word nodes and the strongly related word nodes, so that the syntactic information can be transmitted among the LSTM nodes, and finally the syntactic information is fused into the vector representation of words;
5) performing combined training: and calculating losses of the event trigger word extraction module and the event element extraction module respectively by using a cross entropy loss function, performing joint training on the event trigger word and the event element extraction to avoid the error accumulation problem, and in order that loss items of two subtasks are converged at the same time, the final loss is represented by the sum of the losses of the two subtasks.
In a preferred embodiment of the subject feature extraction of the present invention, in the step 1), the subject feature is extracted as follows:
1-1) obtaining a context expression with context semantic information of each document by using a long Sentence coding-oriented sequence-Transformer;
1-2) then obtaining the theme distribution information of each document by using a theme model LDA;
1-3) training an auto-encoder with the two vectors for fusing the two vectors, with the result from the encoder as the subject feature of each document.
In the preferred embodiment of the implicit sentence structure extraction of the present invention, in the step 2), a training data set is constructed according to the following features:
2-1) replacement of any word in the input sequence with a masking character [ MASK ]]Obtaining a new input sequence, inputting the sequence into BERT to obtain a result hiH is to beiAs xiIs represented by (a);
2-2) obtaining other components x in the sentencejFor xiIn turn will input x in the sequencejAlso specially converted into MASK characters]Then input into BERT to obtain xiNew representation of (A) represents Hij
2-3) calculating H by using Euclidean distanceijAnd hiDistance f (x) in semantic spacei,xj) Finally, the influence degree matrix between every two components in the sentence is obtained
Figure BDA0003296278900000031
The matrix
Figure BDA0003296278900000032
The implicit sentence structure information is the implicit sentence structure information, and the mutual influence degree between any two sentence components can be represented;
in the preferred embodiment of the event trigger extraction of the present invention, in the step 3), the event trigger is extracted according to the following specific steps:
3-1) using a BERT model to perform word segmentation and vectorization on the input sequence, aligning the input sequence with the original label sequence, removing special representation of the BERT such as 'CLS', 'SEP', and using the aligned sequence as the input of CRF.
3-2) performing sequence labeling on the word embedding sequence obtained by using BERT, and labeling whether the words in the input sequence are the beginning ('B') or the internal part ('I') of the trigger word by using CRF only when introducing a BIO labeling method into the task of the chapterOr is independent of the trigger word ("O"). Then the input sequence is labeled by a CRF model to obtain a labeled sequence Ci=[c1,...,ci,…,cn]Wherein c isi∈{B,I,O};
3-3) obtaining a CRF labeling sequence Ci=[c1,...,ci,…,cn]Then for c thereiniWord w for e { B, I }iOr phrase gi=[wp,...,wq]The word w is found from the results of BERTiOr phrase giVector representation of (1), wherein the phrase gi=[wp,...,wq]And taking the average value of word embedding of each word in the phrase as a vector representation of the phrase. The resulting vector is then fed to a fully-connected neural network to make a determination of the particular event type for the word or phrase.
In a preferred embodiment of the event element extraction of the present invention, in the step 4), the event element is extracted according to the following specific steps:
4-1) after the input sequence is participled and vectorized by using a BERT model, aligning the sequence with the original label sequence, and removing special expressions of 'CLS', 'SEP', and the like of the BERT.
4-2) for the input at the current moment, checking the influence degree of other components in the syntactic influence matrix and the corresponding sentence on the input at the current moment, adding the syntactic influence matrix into the calculation process of the node, applying the same calculation mode in the reverse LSTM calculation process, and integrating the syntactic influence information of the context into the vector representation of the whole sentence.
4-3) through calculation in forward and backward directions, a new vector representation sequence and a representation of the whole sentence can be obtained. And for any candidate event trigger word and any candidate event element entity pair, finding a corresponding word vector from the new vector representation sequence, splicing the word vector and the event vector with the event type, and inputting the spliced word vector and the event type into a full-connection classifier to classify the element roles.
The event extraction combination method provided by the invention respectively calculates the loss of the event trigger word extraction module and the event element extraction module by using a cross entropy loss functionAnd the event trigger word and event element extraction are jointly trained to avoid the problem of error accumulation, so that the loss terms of the two subtasks converge at the same time, and the final loss is represented by the sum of the losses of the two subtasks. Meanwhile, introducing an appropriate penalty factor gamma for a loss function used in joint trainingtAnd gammaaThe most suitable loss function is obtained through adjustment, and the loss of the final combined model is as follows:
Figure BDA0003296278900000041
the first item represents the loss of the event trigger word extraction module, the second item represents the loss of the event element extraction module, and the specific parameter meanings are referred to corresponding chapters; gamma raytAnd gammaaThe method respectively corresponds to two main error conditions of event trigger word extraction error and event element extraction error: if the event trigger word has an error, i.e. k equals 1, the loss of the event trigger word extraction module is multiplied by a penalty factor γtIf the event element role classification is wrong only, i.e., k is 0, the loss of the event element extraction module is multiplied by a penalty coefficient γa. For the loss function of the joint model, the parameters were learned using an AdamW optimizer.
Compared with the prior art, the invention has the following advantages:
1) compared with most of the current event extraction joint methods, the event extraction joint method based on the topic characteristics and the implicit sentence structure solves the challenges faced by three event extraction tasks: aiming at the ambiguity problem of the event trigger word, combining BERT vector representation with sentence context semantics and LDA representation with topic distribution information to obtain the topic representation of the document, introducing the topic representation as a characteristic into an event extraction modeling process, and disambiguating the trigger word to a certain extent.
2) Secondly, aiming at the problem of error accumulation possibly caused by the upstream task which is quite important for event extraction through syntactic analysis, a modeling mode of extracting syntactic information hidden in a BERT word embedding result, carrying out joint training on the process and the event extraction two subtasks and jointly optimizing is used, and the problem of error accumulation is avoided while the syntactic information is introduced.
3) Meanwhile, the two methods allow the model to mark a plurality of event trigger words in one sentence, and the trigger words are defaulted to belong to different events so as to solve the challenge of the multi-event problem; in addition, for the candidate entity set in the sample, pairwise matching is carried out on the candidate entity set and the candidate trigger words, and then the relation (element role) between the candidate entity set and the candidate trigger words is determined, namely the model allows one entity to serve as an event element in a plurality of events so as to solve the problem of overlapping of the event elements. Experiments prove that the method effectively solves the three problems, is superior to other methods in recall rate, accuracy and F1 value, and can construct a high-efficiency and high-performance event extraction combined model.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a general framework of the present invention;
fig. 3 is a complete flow chart of the event extraction algorithm of the present invention.
Detailed Description
In order to enhance the understanding and appreciation of the present invention, the following detailed description of the invention is provided in connection with the examples. Example 1: referring to fig. 1-3, an event extraction method based on topic features and implicit sentence structures includes the following 5 steps:
step 1): firstly, preprocessing data, extracting subject characteristics in the preprocessing process, then processing sample data into a sentence-level form, and extracting context characteristics of a sentence, wherein the specific steps are as follows:
(1) document topic feature extraction
For all documents in the data set, respectively obtaining the document context characteristics S [ S ] based on the sequence-transforms1,s2,…,sn]And the topic distribution characteristics of LDA L ═ L1,l2,…,ln]Then, the vector l is distributed due to the topiciThe dimension of (a) is a preset number of topics, and the document context feature vector siUp to 768 dimensions in the dimension(s),the document theme distribution characteristics can be lost by directly splicing the two, so that the model needs to fully fuse the document information of two different angles under the condition of not losing the document theme distribution information. Therefore, the invention utilizes an automatic encoder to effectively fuse two feature vectors, namely, each document DiThe context vector representation and the theme distribution vector representation are spliced by an importance index gamma to obtain a high-dimensional vector representation
Figure BDA0003296278900000061
Then using this high-dimensional vector
Figure BDA0003296278900000062
Training an automatic encoder to realize the dimensionality reduction of high-dimensional vectors so as to fuse the distribution characteristics l of the topicsiAnd contextual characteristics siThe information of (1).
The automatic encoder utilizes the obtained original high-dimensional splicing vector through self-supervision learning
Figure BDA0003296278900000063
Respectively train one slave
Figure BDA0003296278900000064
Encoder for low-dimensional representation of potential space vectors and a mapping from low-dimensional vectors back to high-dimensional stitching vector
Figure BDA0003296278900000065
A decoder of, wherein
Figure BDA0003296278900000066
Gamma is an importance factor.
Finally, the final theme characteristic vector representation T of the ith document theme is obtained by utilizing the trained automatic encoder to encodei
Ti=σ(We([si,γli])+be)#
(2) Sentence context feature extraction
For sentence context features, the present invention utilizes BERT to obtain word embedding information for an input sequence. BERT is a transform-based multi-layered bi-directional language representation model that aims to derive deep tokens with context information by learning the left and right context of each word.
Specifically, BERT is composed of N identical transform encoder modules, and the transform encoder module is denoted as trans (x), and the specific encoding operations are as follows:
h0=SWs+Wp#
hα=Trans(hα-1),α∈[1,N]#
where S is the thermally independent code for each word in the input sentence, WsIs a word-embedding matrix, WpIs a position embedding matrix, p denotes the position index of the current word in the input sequence, hαIs a hidden state vector representing the context representation of the input sentence at the alpha level, and N is the number of transform encoder modules. Considering the length of the coding sequence of the effective position of the BERT and the size of the model actually trained, the invention sets the maximum sequence length as maxLength 200.
For sentence context features, for each input sentence W ═ W1,w2,…,wn]Encoded using BERT to obtain Hi=[h1,h2,…,hn]。
Obtaining the context characteristics H of a sample sentencei=[h1,h2,…,hn]And subject characteristics T of the documentiThen, because both vectors are high-dimensional vectors, the spliced high-dimensional vectors can burden subsequent modules, so the invention reduces the dimension after connecting the two characteristics through a full-connection neural network:
xj=σ(Wf([hj,Ti])+bf)#
the final feature representation X ═ X of the sentence is obtained1,x2,…,xn]The vector will be fed into subsequent modules for event extraction tasks.
Step 2) extracting sentence structure information hidden in the word embedding sequence of each sentence:
(1) converting the input sequence W ═ x1,…,xi,…,xn]Any one word x iniSubstitution into masked characters [ MASK ]]Obtaining a new input sequence W ═ x1,…,MASK,…,xn]Result h obtained by inputting the sequence into BERTiH is to beiAs xiIs represented by (a);
(2) to obtain other components x in the sentencejFor xiFurther, W is [ x ]1,…,MASK,…,xn]X in (2)jAlso specially converted into MASK characters]Then input into BERT to obtain xiNew representation of (A) represents Hij
(3) Calculating f (x)i,xj) Value of (a), f (x)i,xj) Actually for describing the lack of x in a sentencejAfter this context word, how x is represented to BERTiThe influence of (c). The invention calculates HijAnd hiThe distance in the semantic space characterizes the specific value of this effect.
This section calculates H by using Euclidean distanceijAnd hiDistance f (x) in semantic spacei,xj) The specific calculation is as follows:
Figure BDA0003296278900000071
due to the particularity of the word segmentation mechanism of the BERT, a mode that a part of words are segmented into a plurality of sub-words may exist, so when the shielding operation is carried out, the shielding operation is applied to all the sub-word sequences of the BERT by taking one word or one text span as a reference. Meanwhile, considering the standard entity set given by ACE05, this chapter represents sentences as a sequence W ═ x composed of entity text spans1,…,xi,…,xn]Wherein x isi=[wp,…,wq]Means that the ith entity text spans the p-th word and the q-th word and all the words between the p-th word and the q-th wordAnd (4) integration. According to the method, when the influence degree value mentioned by the multi-span entity is calculated, according to the label of the head word given by the multi-span entity in the ACE2005 data set, an importance factor k is introduced when the syntactic influence degree value of the multi-span entity is calculated, after the influence degree value of each word in the multi-span entity is calculated respectively, the influence degree of the head word of the multi-span entity is multiplied by the importance factor k, and then the influence degree average value of all words in the span is calculated to serve as the integral influence degree value.
For any two text span pairs in a sentence<xi,xj>Repeating the above steps and calculating f (x)i,xj) After the values, an N influence matrix can be constructed
Figure BDA0003296278900000072
Where N is the input sequence W ═ x1,…,xi,…,xn]Length of (d). The matrix
Figure BDA0003296278900000073
That is, the extracted sentence structure information can represent the degree of interaction between any two sentence components to illustrate the association relationship between the two sentence components. The specific algorithm flow is as follows:
Figure BDA0003296278900000074
Figure BDA0003296278900000081
and 3) carrying out sequence annotation on the event trigger words by utilizing the cascading CRF:
(1) for an input sequence W ═ W1,...,wi,…,wn]After passing through BERT model, the words are segmented and vectorized into Hi=[h1,...,hi,…,hn]And aligning the sequence with the original tag sequence, including removing "[ CLS ]]”、“[SEP]"A special expression of a class of BERTs" is toThe aligned sequence serves as input for the CRF.
(2) For sequence labeling of word-embedded sequences obtained by BERT, when introducing the BIO labeling method into the task of this chapter, the CRF is used only to label words in the input sequence as the beginning ("B") or the internal part ("I") of the trigger word or independent of the trigger word ("O"). Thus Hi=[h1,h2,…,hn]Obtaining a labeling sequence C after CRF model labelingi=[c1,...,ci,…,cn]Wherein c isi∈{B,I,O};
(3) Obtaining a labeling sequence C of CRFi=[c1,...,ci,…,cn]Then for c thereiniWord w for e { B, I }iOr phrase gi=[wp,...,wq]The word w is found from the results of BERTiOr phrase giVector representation of (1), wherein the phrase gi=[wp,...,wq]And taking the average value of word embedding of each word in the phrase as a vector representation of the phrase. The resulting vector is then fed to a fully-connected neural network to make a determination of the specific event type for the word or phrase using the following equation:
Figure BDA0003296278900000082
finally, the labeling sequence of the trigger words in the sentence is obtained
Figure BDA0003296278900000083
The following cross entropy loss function is still applied to the event-triggered word extraction module:
Figure BDA0003296278900000091
wherein N represents the length of the input sequence W; y isiIs the event type label to which the ith word belongs in W; p is a radical ofiIndicating the ith word as an event triggerEvent type distribution.
Step 4) utilizing Bi-LSTM to introduce implicit sentence structure information to extract event elements:
a Bi-LSTM network is utilized, data in an influence matrix is introduced in the forward and reverse recursion processes, and corresponding relations are established between current word nodes and strongly related word nodes, so that syntactic information can be transmitted among LSTM nodes, and finally the syntactic information is fused into vector representation of words. The overall process of the event element extraction module mainly comprises three steps:
(1) for an input sequence W ═ W1,w2,…,wn]After word segmentation and vectorization using the BERT model, this sequence is aligned with the original tag sequence, including removal "[ CLS ]]”、“[SEP]"Special representation of a class of BERTs to obtain H ═ H1,h2,…,hn]。
(2) At time node t, the calculation process of forward LSTM unit is specifically described: input h for the current timetLooking up syntactic impact matrix
Figure BDA0003296278900000092
Neutralization of htOther component pairs h in the corresponding sentencetThe degree of influence of (c). Due to influence matrix
Figure BDA0003296278900000093
Describing the influence degree among the components in the sentence, and constructing the influence matrix
Figure BDA0003296278900000094
Combining words in the multi-entity span combination into a sentence component according to the entity set labels given by the data set, and for the input word vector sequence H ═ H1,h2,…,hn]All words inside the multi-span entity in (1) apply the relevant data of the belonging entity span in the influence matrix. Meanwhile, the invention sets a threshold value pi only when other components hjOccurs before time step t, and for htDegree of influence of
Figure BDA0003296278900000095
Beyond this threshold value π, h is only set in the following mannerjIntroduction h of informationtIn the calculation of (2):
Figure BDA0003296278900000096
wherein d istThe method is characterized in that a fully-connected network is introduced in order to not influence the calculation of the LSTM, and h is fused by the following calculation methodtAnd hjThe information of (2):
Figure BDA0003296278900000097
applying the same calculation in the inverse LSTM calculation process, the syntactic impact information of the context can be merged into the vector representation of the entire sentence.
(3) Through forward and backward calculation, a new vector representation sequence can be obtained
Figure BDA0003296278900000098
Figure BDA0003296278900000099
And a representation of the entire sentence oLSTM. Composed for any candidate event trigger word and any candidate event element entity<triggeri,entitiyj>For, the sequence H is represented from a new vectorLSTMFind the corresponding word vector
Figure BDA0003296278900000101
And
Figure BDA0003296278900000102
if the trigger word is a multi-span trigger word or a multi-span entity, the vector with the average value of all words in the span as a whole represents hiOr hj. Splicing the two and the event type and inputting the result into a full-connection branchAnd classifying element roles in the classifier:
Figure BDA0003296278900000103
wherein
Figure BDA0003296278900000104
Indicating the distribution of the role of the element that the jth entity acts in the event represented by the ith trigger,
Figure BDA0003296278900000105
vector representation, type, representing the ith trigger predicted at the event trigger extraction moduleiIndicating the event type corresponding to the trigger word,
Figure BDA0003296278900000106
a vector representation representing the jth entity in the sequence of entities.
Obtaining the final event element role labeling sequence
Figure BDA0003296278900000107
Afterwards, the loss function of the event element extraction module still adopts the following cross entropy loss:
Figure BDA0003296278900000108
wherein M represents pairwise<Trigger word, entity>Counting;
Figure BDA0003296278900000109
is the element role referenced by the entity in the ith trigger-entity pair,
Figure BDA00032962789000001010
represents the ith<Trigger word, entity>The entity in the pair refers to the distribution of element roles.
Step 5) event extraction joint modeling method:
the model classifies the pairwise matching of event trigger words and event elements, and multiple events can be used for the same event<Event trigger word tiEvent type eiEvent element aiElement role ri>And (4) representing by a quadruple. There may be multiple cases if there is an error in a quad, and since the classification of event elements in the previous work is generally not good enough on the ACE05 event extraction dataset, this section mainly discusses the case of event element role error. There may be two cases of event element role errors: an event element extraction module obtains wrong global information in the joint modeling process of shared information for event trigger word detection errors or event type discrimination errors; the other is that the event trigger word extraction is correct, and the event element extraction is wrong, which is also divided into two cases: if the event element role r is not contained in the event element role set predefined by the event type e, the event element extraction module still cannot well judge the element type under the condition of giving prior of the event type; if the event element role r is contained in the event element role set predefined by the event type e but is not the correct role corresponding to the current event element, it is indicated that the event element extraction module can effectively utilize prior information brought by the event trigger word and the event type, but the role type cannot be correctly determined under the condition that the number of the element roles is reduced. For model optimization, solving the above three conditions can bring more model lifting, so the loss generated by the above conditions should be increased, and the model can be trained better.
For the above cases, introducing appropriate penalty factor gamma for the loss function used in the joint trainingtAnd gammaaThe most suitable loss function is obtained through adjustment, and the loss of the final combined model is as follows:
Figure BDA0003296278900000111
wherein the first item represents the loss of the event trigger word extraction module, and the second item represents the eventThe loss of the element extraction module, and the meaning of the specific parameters refers to the corresponding chapters; gamma raytAnd gammaaThe two main error cases are respectively corresponded to: if the event trigger word has an error, i.e. k equals 1, the loss of the event trigger word extraction module is multiplied by a penalty factor γtIf the event element role classification is wrong only, i.e., k is 0, the loss of the event element extraction module is multiplied by a penalty coefficient γa
Parameters were learned using an AdamW optimizer for the loss function of the joint model.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims (5)

1. An event extraction method based on topic features and implicit sentence structures is characterized by comprising the following steps:
1) data processing and topic feature extraction: reconstructing an original data set into a JSON format, extracting the theme characteristics of each sample invention file in the read data set, and then performing sentence segmentation on the sample invention files by using a sentence segmentation tool in an NLTK package to obtain sample sentences;
2) extracting an implicit sentence structure: for each sample sentence, firstly, utilizing a language model Bert to obtain word embedding in the sentence as the context characteristics of the sentence, and then utilizing a shielding mechanism to calculate the mutual influence degree between all components in the sentence for the word embedding sequence as the implicit sentence structure characteristics for a subsequent event extraction combination method;
3) the method comprises the steps that an event trigger word extraction module based on a cascading CRF (formula CRF) adopts a cascading sequence labeling method to decompose an extraction task into two tasks of boundary labeling and type judgment, the boundary of an event trigger word is labeled firstly, and then the corresponding event type is judged;
4) an event element extraction module for fusing the Bi-LSTM into the syntactic information is utilized, data in an influence matrix is introduced in the forward and reverse recursion processes, and corresponding links are established between the current word nodes and the strongly related word nodes, so that the syntactic information can be transmitted among the LSTM nodes, and finally the syntactic information is fused into the vector representation of words;
5) and joint training, namely calculating losses of the event trigger word extraction module and the event element extraction module by using a cross entropy loss function, performing joint training on the event trigger word and the event element extraction to avoid the error accumulation problem, and representing the final loss by the sum of the losses of the two subtasks in order that the loss items of the two subtasks are converged at the same time.
2. The method for extracting events based on topic features and implicit sentence structures according to claim 1, wherein in the step 1), the topic features are extracted as follows:
1-1) obtaining a context expression with context semantic information of each document by using a long Sentence coding-oriented sequence-Transformer;
1-2) then obtaining the theme distribution information of each document by using a theme model LDA;
1-3) training an auto-encoder with the two vectors for fusing the two vectors, with the result from the encoder as the subject feature of each document.
3. The method for extracting events based on topic features and implicit sentence structures according to claim 1, wherein the training data set is constructed in the step 2) according to the following features:
2-1) will enter any word x in the sequenceiSubstitution into masked characters [ MASK ]]Obtaining a new input sequence, inputting the sequence into BERT to obtain a result hiH is to beiAs xiIs represented by (a);
2-2) obtaining other components x in the sentencejFor xiIn turn will input x in the sequencejAlso specially converted into MASK characters]Then input into BERT to obtain xiNew representation of (A) represents Hij
2-3) calculating H by using Euclidean distanceijAnd hiDistance f (x) in semantic spacei,xj) Finally, the influence degree matrix between every two components in the sentence is obtained
Figure FDA0003296278890000021
The matrix
Figure FDA0003296278890000022
Namely the implicit sentence structure information, the mutual influence degree between any two sentence components can be represented.
4. The method for extracting events based on topic features and implicit sentence structures according to claim 1, wherein the step 3) comprises the following specific steps:
3-1) performing word segmentation and vectorization on an input sequence by using a BERT model, aligning the input sequence with an original label sequence, removing special representation of 'CLS', 'SEP', and taking the aligned sequence as the input of CRF;
3-2) carrying out sequence annotation on the word embedding sequence obtained by using BERT, and only using CRF to label whether the words in the input sequence are the beginning ('B') or the internal part ('I') of the trigger word or are not related to the trigger word ('O') when introducing a BIO labeling method into the task of the chapter, so that the input sequence obtains an annotation sequence C after being labeled by a CRF modeli=[c1,...,ci,…,cn]Wherein c isi∈{B,I,O};
3-3) obtaining a CRF labeling sequence Ci=[c1,...,ci,…,cn]Then for c thereiniWord w for e { B, I }iOr phrase gi=[wp,...,wq]The word w is found from the results of BERTiOr phrase giVector representation of (1), wherein the phrase gi=[wp,...,wq]Using the average value of word embedding of each word in the phrase as the vector representation of the phrase, and then feeding the obtained vector to a fullAnd the connecting neural network judges the specific event type of the word or the phrase.
5. The method for extracting events based on topic features and implicit sentence structures according to claim 1, wherein the event element extraction in step 4) is performed according to the following specific steps:
4-1) after word segmentation and vectorization are carried out on an input sequence by using a BERT model, aligning the sequence with an original label sequence, and removing special representation of a class of BERTs such as 'CLS', 'SEP';
4-2) for the input of the current moment, checking the influence degree of other components in the syntactic influence matrix and the corresponding sentence on the input of the current moment, adding the syntactic influence matrix into the calculation process of the node, and applying the same calculation mode in the reverse LSTM calculation process to blend the syntactic influence information of the context into the vector representation of the whole sentence;
4-3) obtaining a new vector representation sequence and representation of the whole sentence through forward and backward calculation, finding out a corresponding word vector from the new vector representation sequence for any candidate event trigger word and any candidate event element entity pair, splicing the word vector and the event type, and inputting the spliced word vector and the event type into a full-connection classifier to classify element roles.
CN202111178364.5A 2021-10-09 2021-10-09 Event extraction method based on topic features and implicit sentence structure Pending CN113901813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111178364.5A CN113901813A (en) 2021-10-09 2021-10-09 Event extraction method based on topic features and implicit sentence structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111178364.5A CN113901813A (en) 2021-10-09 2021-10-09 Event extraction method based on topic features and implicit sentence structure

Publications (1)

Publication Number Publication Date
CN113901813A true CN113901813A (en) 2022-01-07

Family

ID=79190805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111178364.5A Pending CN113901813A (en) 2021-10-09 2021-10-09 Event extraction method based on topic features and implicit sentence structure

Country Status (1)

Country Link
CN (1) CN113901813A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860002A (en) * 2022-12-27 2023-03-28 中国人民解放军国防科技大学 Combat task generation method and system based on event extraction
CN117828075A (en) * 2023-12-14 2024-04-05 北京市农林科学院信息技术研究中心 Agricultural condition data classification method, agricultural condition data classification device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115860002A (en) * 2022-12-27 2023-03-28 中国人民解放军国防科技大学 Combat task generation method and system based on event extraction
CN115860002B (en) * 2022-12-27 2024-04-05 中国人民解放军国防科技大学 Combat task generation method and system based on event extraction
CN117828075A (en) * 2023-12-14 2024-04-05 北京市农林科学院信息技术研究中心 Agricultural condition data classification method, agricultural condition data classification device and storage medium

Similar Documents

Publication Publication Date Title
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN111382565B (en) Emotion-reason pair extraction method and system based on multiple labels
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN108182295A (en) A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN106844349B (en) Comment spam recognition methods based on coorinated training
CN112417854A (en) Chinese document abstraction type abstract method
CN111832293B (en) Entity and relation joint extraction method based on head entity prediction
CN112434535A (en) Multi-model-based factor extraction method, device, equipment and storage medium
CN111651974A (en) Implicit discourse relation analysis method and system
CN113901813A (en) Event extraction method based on topic features and implicit sentence structure
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN112699685B (en) Named entity recognition method based on label-guided word fusion
CN113011161A (en) Method for extracting human and pattern association relation based on deep learning and pattern matching
CN113255321A (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114117041B (en) Attribute-level emotion analysis method based on specific attribute word context modeling
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN115329088A (en) Robustness analysis method of graph neural network event detection model
CN114818718A (en) Contract text recognition method and device
CN114647730A (en) Event detection method integrating graph attention and graph convolution network
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN114881038B (en) Chinese entity and relation extraction method and device based on span and attention mechanism
CN116562286A (en) Intelligent configuration event extraction method based on mixed graph attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination