CN113282714B

CN113282714B - Event detection method based on differential word vector representation

Info

Publication number: CN113282714B
Application number: CN202110726463.6A
Authority: CN
Inventors: 唐九阳; 廖劲智; 赵翔; 李欣奕; 谭真; 陈盈果; 黄魁华
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2022-09-20
Anticipated expiration: 2041-06-29
Also published as: CN113282714A

Abstract

The invention relates to an event detection method based on discriminative word vector representation, which comprises the following steps: the method comprises the steps of constructing a discriminative word vector representation model, wherein the discriminative word vector representation model comprises a coding module, a Gaussian kernel function module and a countercheck learning module, the coding module is used for enabling each word in a sentence to be represented in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the discriminative ability of representation between the words formed in a trigger word and other words outside the trigger word, and the countercheck learning module is used for improving the generalization recognition capability of a positive sample of the trigger word; and predicting whether each word belongs to the initial position or the end position of a certain trigger word under the class one by one on all event types by using a trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions.

Description

Event detection method based on differential word vector representation

Technical Field

The invention relates to the technical field of event detection in natural language processing, in particular to an event detection method based on distinguishability word vector representation.

Background

Retrieving and extracting event instances from text plays a key role in natural language related tasks such as auto-questioning and answering, dialog systems, etc., where the first task to be done is event detection. Event Detection (ED) addresses two issues: 1) identifying a trigger, the trigger being a class of words used to refer to a particular event in the text, including but not limited to a single verb, noun, or phrase; 2) and judging classification, namely judging the category of the trigger word through the trigger word and the related text.

The ED has attracted extensive attention by researchers because it facilitates many downstream applications in natural language processing, such as question-answering, spatiotemporal event information retrieval, and machine-reading understanding. Specifically, there are existing methods that incorporate feature engineering techniques to manually construct features; in order to solve the scarcity of data, a data enhancement technology is adopted to increase the scale of training data; and based on recent neural network developments, potential word representations are introduced to better perform ED.

Between the two subtasks of the ED, the result of the trigger word recognition is the basis for the trigger word classification. However, it is not easy to correctly recognize the trigger word, because the current data scarcity becomes a non-negligible problem in ED, which requires the model to be able to determine the text boundary of the trigger word in the sentence more accurately. However, if the model does not focus on the representation of the word, the semantic information contained in the word vector may be too ambiguous, which in turn may cause the detection of the boundaries of the trigger word to be a difficult challenge. In this case, if the model is too "cautious," it may tend to make a confident prediction, perhaps ignoring some of the triggers, and thus missing some events; if the model is "aggressive," it may introduce much prediction noise, which may increase the difficulty of detecting trigger boundaries. This embodiment defines the problem as a trigger fragment detection problem in event detection.

This problem severely affects the performance of ED. First, the existing ED methods generate many false negative cases with much higher accuracy than the recall rate. Secondly, the error analysis showed that more than 83% of the mishaps were considered to be caused by the problem. In this case, PLMEE, a representative most advanced (SOTA) method, not only mispredicts the number of trigger words, but also confuses certain boundaries of the trigger words (e.g., "dependency" and "depth dependency"). Furthermore, current ED methods ignore the problem of trigger fragment detection and lack specialized processing methods in identifying event triggers.

Disclosure of Invention

The present invention is directed to at least solving the problems of the prior art. Therefore, the invention provides an event detection method based on the distinguishability word vector representation. The method learns discriminative word vector representation (DER) from text, and with DER, the model expects to be able to accurately identify each trigger word and correctly mark its segment. To achieve this goal, the method of the invention proposes a new framework of classical solutions based on neural information extraction, which exploits two promising techniques: 1) gaussian kernel function coding, which expands the difference between the representation of words inside the trigger word and other words in the outer sentence, 2) an antagonistic learning strategy, which promotes the generalized recognition capability of positive samples of the trigger word.

A method of event detection based on discriminative word vector representation, the method comprising:

step 1, constructing a discriminative word vector representation model, which comprises a coding module, a Gaussian kernel function module and an antagonistic learning module, wherein the coding module is used for generating each word in a sentence into a representation in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the difference of the representation between the internal component word of a trigger word and other external words, and the antagonistic learning module is used for improving the generalization recognition capability of a positive sample of the trigger word;

step 2, in the coding module, each word of a sentence is embedded into a context word vector representation in a high-dimensional vector space by using a pre-trained BERT model so as to provide input containing semantic features, and simultaneously, information contained in the word representation is further enriched by combining external knowledge of predefined event types;

step 3, in the Gaussian kernel function module, Gaussian kernel function transformation is carried out on the coded word vector representation, and the distribution of the word vectors is constrained in Gaussian distribution by Gaussian processing, so that clustering of the word vectors in a high-dimensional space is realized, and the differential coding capability of the word vectors on the trigger words and the non-trigger words is improved;

step 4, in the confrontation learning module, adding random disturbance into the word vector during training to enable the model to pay more attention to regular semantic information in the training sample, and further improve the generalization capability of the model triggering word positive sample;

step 5, predicting whether each word belongs to the initial position or the end position of a certain trigger word under the category one by one on all event types by using the trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions;

further, in the encoding module, a BERT-based language representation model is used as an encoder, the BERT consisting of a stack of 12 identical transform blocks, each block handling word embedding, position embedding, and segment embedding. After all the blocks calculate three types of embedding in turn, the BERT outputs the sum of them as a representation, and at the same time, in the encoding module, the self-attention mechanism in the BERT is enhanced by using predefined trigger word types as external knowledge, and all the upper event types are connected with each sentence by using only the upper event types as external knowledge, and the specific form is as follows:

[CLS]sentence[SEP]UT ₁ [SEP]···[SEP]UT _m [SEP],

wherein [ CLS ] represents a start position mark in the BERT, the presence represents an input specific sentence, [ SEP ] represents a separator mark in the BERT, UT is an abbreviation of upper-type, represents an upper-level type of an event, and m is the number of upper-level event categories in the data set.

Furthermore, the Gaussian kernel coding module adopts the following method steps:

after passing through the coding module, the expression E E R of each word can be obtained ^l×d Where R represents a real space of dimension d and l represents the sequence length of the input text. The whole Gaussian kernel function mapping process is composed of an average word vector representation and a kernel function, namely:

p(X)＝N(X|mean(E),K _EE )

X＝f(E)

wherein p represents a priori summaryThe rate conforms to Gaussian distribution, N represents a Gaussian distribution symbol, mean represents the average word vector representation in the target sequence, f represents the full-connection network to map the word vector representation, and K represents the maximum probability of the word vector representation _EE The kernel function is expressed and specifically defined as follows:

[K _EE ] _ij ＝k(E ⁱ ,E ^j )＝exp(-γ||E ⁱ -E ^j || ² )

wherein k represents kernel function operation, i and j represent word vectors at corresponding positions in the text, exp represents a natural exponential function, γ represents a hyper-parameter,

representing the norm of the vector.

Using interpolation to extract data samples of a certain scale from the word vectors, the method comprises the following steps:

U＝{f(I ¹ ),f(I ² ),...,f(I ^k )}

wherein, U represents a word vector sequence obtained after interpolation, I belongs to R ^d Representing word vectors that are acquired out of sequential order. When the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:

p(U)＝N(mean(I),K _II )

obtaining the difference sequence I ═ (I) ¹ ，I ² ，...,I ^m ) And m represents the number of samples selected for interpolation.

Defining a posterior probability q whose probability distribution satisfies a Gaussian distribution with a mean μ and a variance σ in the distribution ² The method is calculated based on a neural network, and specifically comprises the following steps:

representing the loss value calculated by the gaussian kernel function module,

and

respectively representing the loss values calculated by the confrontation learning module at the starting position and the ending position, wherein alpha epsilon (0,1) represents the magnitude of the order of the hyperparameter used for controlling the loss value; the loss value of the gaussian kernel module is calculated as follows:

wherein E represents expectation, ln represents a logarithmic function with a natural number as the base, KL represents relative entropy, and | is a special mark symbol in the relative entropy, and has no practical meaning. The aim is to make the two probability distributions designed in the calculation as similar as possible, so that in relative entropy, either fixed q (x) or p (x) can satisfy:

furthermore, measuring the probability distribution in addition to the relative entropy, there are, for example, the babbit distance:

the relative entropy is chosen here because it allows better sampling calculations in neural network calculations.

p (UX) represents the conditional probability, also fitting a Gaussian probability distribution, whose probability is calculated as follows:

wherein [ K ] _IE ] _ij ＝k(I ⁱ ,E ^j )，K ^-1 Represents a transposition and

furthermore, the countermeasure learning module adopts the following steps of countermeasure learning method:

firstly, a random disturbance generation mode in the countermeasure learning is constructed and expressed as follows:

wherein r is _adv Representing the random perturbation of the final input, r represents the random perturbation,

representing a two-norm, epsilon represents a hyper-parameter,

representing the loss function and theta representing the parameter to be learned in the model.

The random perturbation is generated using linear approximation, and is expressed as:

r _adv ＝-εg/||g||

wherein g represents a loss function

The gradient of E is represented for the input word vector,

representing gradient operations, f model operations, and y sample labels. Word vector representation at the encoded layer E ∈ R ^d Add random perturbation, expressed as:

E+r _adv

the resulting representation is added to the event extraction body framework, in particular, each word is divided into n classes, where n is the number of event types, and then, according to each type of predictive label, for each sentence, there are two identical start position and end position classifiers, respectively, the detailed operation of the classifier for each word is as follows,

wherein

Is the probability for all event types of the starting position of the ith word being recognized from the sentence and classified as a trigger,

is the probability of the end position of the ith word being recognized from the sentence and classified as a trigger word for all event types, sigmoid is a non-linear activation function, W _l And W _r Are trainable weights in neural networks, and b _l And b _r Is a deviation term. The integrated loss value in the challenge learning is calculated as follows:

wherein the content of the first and second substances,

representing the loss value generated by the event extraction process, P representing the probability of the model predicting the word, P _adv Representing the probability of the countervailing learning module predicting the word, L representing the true label of the word to be predicted, and gamma epsilon (0,1) is a hyper-parameter used for balancing the weight of the two parts;

wherein the content of the first and second substances,

the loss follows a binary cross-entropy loss function, and is computed as:

wherein the content of the first and second substances,

representing a loss value calculated via a binary cross entropy loss function, P representing a predicted probability of a word in a sentence, and L representing a set of true tags; t is a set of event types, S is a selected sentence, | · | represents the number of specific objects, k is greater than or equal to 1 and less than or equal to n, and n is the number of event types;

the resulting optimized loss function is synthesized as follows:

wherein the content of the first and second substances,

representing the loss value calculated by the gaussian kernel function module,

and

respectively, the loss values calculated by the attack learning module at the start and end positions, and alpha epsilon (0,1) represents the order of magnitude of the hyperparameter used to control the loss value.

Compared with the prior art, the method has the advantages that: a new learning framework DER aiming at the ED problem is provided, the learning framework DER comprises two innovatively designed modules, namely a Gaussian kernel function and counterlearning, and the capability of distinguishing words inside and outside a trigger word is improved; the method of the invention is the first idea of introducing a Gaussian kernel function into the ED, which is orthogonal to the SOTA solution of the ED; a number of experiments on standard datasets have shown that DER models are effective in solving the trigger word segment recognition problem.

Drawings

FIG. 1 is a diagram of an exemplary scenario in an embodiment of the present invention;

fig. 2 shows a schematic flow diagram of a framework of an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The present embodiment follows the terminology in the Automatic Content Extraction (ACE) sharing task. The event mention is a phrase or sentence for describing an event, and comprises trigger words and corresponding component elements; trigger words are some of the words that most clearly express event mentions.

The standard tasks of ED on ACE include event-triggered word recognition and corresponding type classification. Consider the example in fig. 1. "death dependency" is an event trigger whose event type is composed of a higher type "just" and a sub-type "Execute", thereby forming a combination type "just: execute ". Thus, given this sentence, ED will predict: 1) "depth dependency" is an event trigger, and the event type is "just: execute "; 2) "connected" is an event trigger, the event type of which is "just: convict ".

Model frame

Fig. 2 is a schematic diagram of an overall DER framework, which mainly includes.

In order to make it possible to properly handle the trigger range detection problem, the present embodiment designs two new modules to upgrade the framework. In short, the DER model consists of three components: the device comprises an encoding module, a Gaussian kernel function module and a confrontation learning module.

step 3, in the Gaussian kernel function module, Gaussian kernel function transformation is carried out on the coded word vector representation, and Gaussian processing is utilized to constrain the distribution of the word vectors in Gaussian distribution so as to realize clustering of the word vectors in a high-dimensional space and improve the differential coding capability of the word vectors on trigger words and non-trigger words;

and 6, detecting an event in the text according to the prediction result of the trigger word.

Specifically, in the coding module, a BERT-based language representation model is used as the encoder, the BERT is composed of a stack of 12 identical transform blocks, each block handles word embedding, position embedding, and segment embedding, after all blocks compute three types of embedding in turn, the BERT outputs their sum as a representation, and at the same time, in the coding module, the self-attention mechanism in the BERT is enhanced by using a predefined trigger word type as external knowledge, and all upper event types are connected to each sentence using only the upper event type as external knowledge, in the following specific form:

[CLS]sentence[SEP]UT ₁ [SEP]···[SEP]UT _m [SEP],

Specifically, in the gaussian kernel function module:

after passing through the coding module, the expression E E R of each word can be obtained ^l×d Wherein, R represents a real space with dimension d, l represents a sequence length of an input text, and the whole gaussian kernel function mapping process is composed of two parts, namely, an average word vector representation and a kernel function:

p(X)＝N(X|mean(E),K _EE ) (1)

X＝f(E)

wherein, p represents prior probability, namely Gaussian probability distribution, N represents Gaussian distribution symbol, mean represents average word vector representation in the target sequence, f represents mapping to the word vector representation by a full-connection network, K _EE The kernel function is expressed and specifically defined as follows:

[K _EE ] _ij ＝k(E ⁱ ,E ^j )＝exp(-γ||E ⁱ -E ^j || ² ) (2)

where k denotes the kernel function operation, E ⁱ And E ^j Respectively, word vectors corresponding to i and j positions in the text, exp denotes a natural exponential function, gamma denotes a hyper-parameter,

a norm representing a vector;

U＝{f(I ¹ ),f(I ² ),...,f(I ^k )} (3)

wherein, U represents a word vector sequence obtained after interpolation, I belongs to R ^d Representing word vectors acquired out of sequence order; when the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:

p(U)＝N(mean(I),K _II ) (4)

wherein, K _II Expressed kernel function and K _EE The same is true.

step 401, constructing a random disturbance generation mode in countermeasure learning, which is expressed as:

representing a two-norm, epsilon represents a hyper-parameter,

representing a loss function, and theta represents a parameter needing to be learned in the model;

step 402, generating the random perturbation by using linear approximation, which is expressed as:

r _adv ＝-εg/||g|| (6)

wherein g represents a loss function

The gradient of E is represented for the input word vector,

representing gradient operation, f representing model operation, and y representing sample label;

step 403, the word vector representation E R in the encoded layer ^d Add random perturbation to, expressed as:

E+r _adv

at step 404, a randomly perturbed word vector representation is used as an input to the loss function.

Further, the synthetic loss function during model training is:

wherein the content of the first and second substances,

representing the loss value calculated by the gaussian kernel function module,

and

wherein E represents expectation, ln represents a logarithmic function with a natural number as a base, KL represents relative entropy, | | is a special mark symbol in the relative entropy without practical meaning, and p (U | X) represents conditional probability, which is calculated as follows:

wherein [ K ] _IE ] _ij ＝k(I ⁱ ,E ^j )，K ^-1 Represents transposed and

q (U | X) represents the prior probability, q is also the posterior probability satisfying the gaussian distribution, and the calculation of q (U | X) is based on a neural network, as follows:

wherein, mu and sigma ² Means and variances representing the output of the neural network, i.e. the q-distribution;

the loss value of the confrontational learning module is calculated as follows:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

the loss follows a binary cross-entropy loss function, calculated as:

wherein the content of the first and second substances,

representing a loss value calculated via a binary cross entropy loss function, P representing a predicted probability of a word in a sentence, and L representing a set of true tags; t is the set of event types, S is the selected sentence, | · | represents the number of specific objects, 1 ≦ k ≦ n, and n is the number of event types.

Setting of experiments

A data set. This example evaluates two standard data sets: 2005 automatic content extraction dataset (ACE 2005). The statistical description about this data set is shown in table 1, which is the most widely used data set in event-related tasks, containing 599 documents. All events are labeled 8 types and 33 seed types. The present embodiment evaluates 33 combination type classifications. Based on previous studies, the present embodiment divided 599 documents into 529 training documents, 30 verification documents and 40 test documents.

Table 1 data set analysis

And (4) indexes. The present embodiment evaluates the metrics following the criteria for event detection, which have two aspects: and (4) identifying and classifying. If the segment of the event trigger word is matched with the real trigger word, the event trigger word can be correctly identified; and if the event trigger words and the corresponding event types are matched with the real trigger words and the categories, the event trigger words are considered to be classified correctly. This example reports micro-average accuracy (Pre), recall (Rec) and F1 score (F1) in all assessments.

A baseline. The following 9 methods were used for comparison: (1) the method based on the characteristics comprises the following steps: three representative methods are involved. Cross event utilizes document information with complex functions, MaxEnt only adopts the characteristics of artificial design, and Combined-PSL utilizes global information by using a probability soft logic model. (2) Enhancement-based methods: two representative methods are involved. GMLATT employs a gated cross-language attention mechanism to take advantage of the complementary information of multilingual data transfer, and AD-DMBERT uses a confrontation model to obtain more training data. (3) Neural-based methods: four representative methods are involved. DMCNN uses CNN to automatically extract features, GCN-ED uses graph convolution network to capture grammatical information, JMEE uses multi-language information to perform more accurate context modeling, DISTILL introduces delta learning method to refine generalized knowledge.

Overall results

Table 2 overall results on ACE 2005. In addition to the DER results, other results were drawn from the original papers.

In all tasks, the DER model is superior to the comparative method, and SOTA performance is achieved in all indexes, which proves the superiority of the module provided by the embodiment and the effectiveness of solving the trigger fragment detection problem. In particular, for the task of type classification, the DER model outperforms the SOTA method AD-DMBERT 0.2% based on data enhancement and the SOTA method DISTILL 1.3% based on neural, in the case of F1.

In the enhancement-based approach, although the performance of the approach in this branch is relatively good, the generated samples are inaccurate and unbalanced, which can lead to problems of overfitting. This problem makes the model perform well only on samples that have been present in the training data, but lacks generalization capability, resulting in high final prediction accuracy but low recall. The DER model does not require an additional/external data source compared to the enhancement-based approach. In the representation-based approach, many complex structures are designed for the ED task, which also leads to overfitting problems. In contrast to representation-based approaches, DER models do not rely on complex structures.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A method for event detection based on discriminative word vector representation, the method comprising:

2. The method of claim 1, wherein in the coding module, a BERT-based language representation model is used as the coder, the BERT is composed of a stack of 12 identical transform blocks, each block handles word embedding, position embedding and segment embedding, after all blocks compute three types of embedding in turn, the BERT outputs their sum as a representation, and at the same time, in the coding module, the self-attention mechanism in the BERT is enhanced using predefined trigger word types as external knowledge, all upper event types are connected with each sentence using only the upper event type as external knowledge, in the following specific form:

[CLS]sentence[SEP]UT ₁ [SEP]…[SEP]UT _m [SEP],

where [ CLS ] represents a start position flag in BERT, sense represents a specific sentence inputted, [ SEP ] represents an interval flag in BERT, UT is an abbreviation of upper-type, represents an upper type of an event, and m is the number of upper event categories in a data set.

3. The method of claim 2, wherein the Gaussian kernel function module comprises:

p(X)＝N(X|mean(E),K _EE ) (1)

X＝f(E)

wherein p represents prior probability, i.e., Gaussian probability distribution, and N represents GaussianDistributing symbols, mean representing the average word vector representation in the target sequence, f representing the mapping of the word vector representation by the fully-connected network, K _EE The kernel function is expressed and specifically defined as follows:

[K _EE ] _ij ＝k(E ⁱ ,E ^j )＝exp(-γ||E ⁱ -E ^j || ² ) (2)

a norm representing a vector;

using interpolation method to extract data sample in word vector to obtain a certain scale, including:

U＝{f(I ¹ ),f(I ² ),...,f(I ^k )} (3)

wherein, U represents a word vector sequence obtained after interpolation, I belongs to R ^d Representing word vectors that are acquired out of sequence order; when the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:

p(U)＝N(mean(I),K _II ) (4)

wherein, K _II Expressed kernel function and K _EE The same is true.

4. The event detection method based on the discriminative word vector representation as claimed in claim 3, wherein the counterstudy module adopts the following counterstudy method steps:

wherein r is _adv Representing the random perturbation of the final input, r representsThe random disturbance is carried out on the surface of the material,

representing a two-norm, epsilon represents a hyperparameter,

r _adv ＝-εg/||g|| (6)

wherein g represents a loss function

The gradient of E is represented for the input word vector,

step 403, the word vector in the encoded layer represents E ∈ R ^d Add random perturbation to, expressed as:

E+r _adv

5. The method of claim 3, wherein the synthetic loss function in the model training process is as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the loss value calculated by the gaussian kernel function module,

and

L _G ＝E[-lnp(U|X)]+KL[q(U|X)||p(U)] (9)

where E represents expectation, ln represents a logarithmic function with a natural number as the base, KL represents relative entropy, | | represents a special notation in the relative entropy, without practical meaning, and p (U | X) represents conditional probability, which is calculated as follows:

the loss value of the confrontational learning module is calculated as follows:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

the loss follows a binary cross-entropy loss function, calculated as:

wherein the content of the first and second substances,