CN113282714B - Event detection method based on differential word vector representation - Google Patents

Event detection method based on differential word vector representation Download PDF

Info

Publication number
CN113282714B
CN113282714B CN202110726463.6A CN202110726463A CN113282714B CN 113282714 B CN113282714 B CN 113282714B CN 202110726463 A CN202110726463 A CN 202110726463A CN 113282714 B CN113282714 B CN 113282714B
Authority
CN
China
Prior art keywords
word
representing
module
event
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110726463.6A
Other languages
Chinese (zh)
Other versions
CN113282714A (en
Inventor
唐九阳
廖劲智
赵翔
李欣奕
谭真
陈盈果
黄魁华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110726463.6A priority Critical patent/CN113282714B/en
Publication of CN113282714A publication Critical patent/CN113282714A/en
Application granted granted Critical
Publication of CN113282714B publication Critical patent/CN113282714B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an event detection method based on discriminative word vector representation, which comprises the following steps: the method comprises the steps of constructing a discriminative word vector representation model, wherein the discriminative word vector representation model comprises a coding module, a Gaussian kernel function module and a countercheck learning module, the coding module is used for enabling each word in a sentence to be represented in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the discriminative ability of representation between the words formed in a trigger word and other words outside the trigger word, and the countercheck learning module is used for improving the generalization recognition capability of a positive sample of the trigger word; and predicting whether each word belongs to the initial position or the end position of a certain trigger word under the class one by one on all event types by using a trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions.

Description

Event detection method based on differential word vector representation
Technical Field
The invention relates to the technical field of event detection in natural language processing, in particular to an event detection method based on distinguishability word vector representation.
Background
Retrieving and extracting event instances from text plays a key role in natural language related tasks such as auto-questioning and answering, dialog systems, etc., where the first task to be done is event detection. Event Detection (ED) addresses two issues: 1) identifying a trigger, the trigger being a class of words used to refer to a particular event in the text, including but not limited to a single verb, noun, or phrase; 2) and judging classification, namely judging the category of the trigger word through the trigger word and the related text.
The ED has attracted extensive attention by researchers because it facilitates many downstream applications in natural language processing, such as question-answering, spatiotemporal event information retrieval, and machine-reading understanding. Specifically, there are existing methods that incorporate feature engineering techniques to manually construct features; in order to solve the scarcity of data, a data enhancement technology is adopted to increase the scale of training data; and based on recent neural network developments, potential word representations are introduced to better perform ED.
Between the two subtasks of the ED, the result of the trigger word recognition is the basis for the trigger word classification. However, it is not easy to correctly recognize the trigger word, because the current data scarcity becomes a non-negligible problem in ED, which requires the model to be able to determine the text boundary of the trigger word in the sentence more accurately. However, if the model does not focus on the representation of the word, the semantic information contained in the word vector may be too ambiguous, which in turn may cause the detection of the boundaries of the trigger word to be a difficult challenge. In this case, if the model is too "cautious," it may tend to make a confident prediction, perhaps ignoring some of the triggers, and thus missing some events; if the model is "aggressive," it may introduce much prediction noise, which may increase the difficulty of detecting trigger boundaries. This embodiment defines the problem as a trigger fragment detection problem in event detection.
This problem severely affects the performance of ED. First, the existing ED methods generate many false negative cases with much higher accuracy than the recall rate. Secondly, the error analysis showed that more than 83% of the mishaps were considered to be caused by the problem. In this case, PLMEE, a representative most advanced (SOTA) method, not only mispredicts the number of trigger words, but also confuses certain boundaries of the trigger words (e.g., "dependency" and "depth dependency"). Furthermore, current ED methods ignore the problem of trigger fragment detection and lack specialized processing methods in identifying event triggers.
Disclosure of Invention
The present invention is directed to at least solving the problems of the prior art. Therefore, the invention provides an event detection method based on the distinguishability word vector representation. The method learns discriminative word vector representation (DER) from text, and with DER, the model expects to be able to accurately identify each trigger word and correctly mark its segment. To achieve this goal, the method of the invention proposes a new framework of classical solutions based on neural information extraction, which exploits two promising techniques: 1) gaussian kernel function coding, which expands the difference between the representation of words inside the trigger word and other words in the outer sentence, 2) an antagonistic learning strategy, which promotes the generalized recognition capability of positive samples of the trigger word.
A method of event detection based on discriminative word vector representation, the method comprising:
step 1, constructing a discriminative word vector representation model, which comprises a coding module, a Gaussian kernel function module and an antagonistic learning module, wherein the coding module is used for generating each word in a sentence into a representation in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the difference of the representation between the internal component word of a trigger word and other external words, and the antagonistic learning module is used for improving the generalization recognition capability of a positive sample of the trigger word;
step 2, in the coding module, each word of a sentence is embedded into a context word vector representation in a high-dimensional vector space by using a pre-trained BERT model so as to provide input containing semantic features, and simultaneously, information contained in the word representation is further enriched by combining external knowledge of predefined event types;
step 3, in the Gaussian kernel function module, Gaussian kernel function transformation is carried out on the coded word vector representation, and the distribution of the word vectors is constrained in Gaussian distribution by Gaussian processing, so that clustering of the word vectors in a high-dimensional space is realized, and the differential coding capability of the word vectors on the trigger words and the non-trigger words is improved;
step 4, in the confrontation learning module, adding random disturbance into the word vector during training to enable the model to pay more attention to regular semantic information in the training sample, and further improve the generalization capability of the model triggering word positive sample;
step 5, predicting whether each word belongs to the initial position or the end position of a certain trigger word under the category one by one on all event types by using the trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions;
further, in the encoding module, a BERT-based language representation model is used as an encoder, the BERT consisting of a stack of 12 identical transform blocks, each block handling word embedding, position embedding, and segment embedding. After all the blocks calculate three types of embedding in turn, the BERT outputs the sum of them as a representation, and at the same time, in the encoding module, the self-attention mechanism in the BERT is enhanced by using predefined trigger word types as external knowledge, and all the upper event types are connected with each sentence by using only the upper event types as external knowledge, and the specific form is as follows:
[CLS]sentence[SEP]UT 1 [SEP]···[SEP]UT m [SEP],
wherein [ CLS ] represents a start position mark in the BERT, the presence represents an input specific sentence, [ SEP ] represents a separator mark in the BERT, UT is an abbreviation of upper-type, represents an upper-level type of an event, and m is the number of upper-level event categories in the data set.
Furthermore, the Gaussian kernel coding module adopts the following method steps:
after passing through the coding module, the expression E E R of each word can be obtained l×d Where R represents a real space of dimension d and l represents the sequence length of the input text. The whole Gaussian kernel function mapping process is composed of an average word vector representation and a kernel function, namely:
p(X)=N(X|mean(E),K EE )
X=f(E)
wherein p represents a priori summaryThe rate conforms to Gaussian distribution, N represents a Gaussian distribution symbol, mean represents the average word vector representation in the target sequence, f represents the full-connection network to map the word vector representation, and K represents the maximum probability of the word vector representation EE The kernel function is expressed and specifically defined as follows:
[K EE ] ij =k(E i ,E j )=exp(-γ||E i -E j || 2 )
wherein k represents kernel function operation, i and j represent word vectors at corresponding positions in the text, exp represents a natural exponential function, γ represents a hyper-parameter,
Figure BDA0003137742670000041
representing the norm of the vector.
Using interpolation to extract data samples of a certain scale from the word vectors, the method comprises the following steps:
U={f(I 1 ),f(I 2 ),...,f(I k )}
wherein, U represents a word vector sequence obtained after interpolation, I belongs to R d Representing word vectors that are acquired out of sequential order. When the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:
p(U)=N(mean(I),K II )
obtaining the difference sequence I ═ (I) 1 ,I 2 ,...,I m ) And m represents the number of samples selected for interpolation.
Defining a posterior probability q whose probability distribution satisfies a Gaussian distribution with a mean μ and a variance σ in the distribution 2 The method is calculated based on a neural network, and specifically comprises the following steps:
Figure BDA0003137742670000051
Figure BDA0003137742670000052
representing the loss value calculated by the gaussian kernel function module,
Figure BDA0003137742670000053
and
Figure BDA0003137742670000054
respectively representing the loss values calculated by the confrontation learning module at the starting position and the ending position, wherein alpha epsilon (0,1) represents the magnitude of the order of the hyperparameter used for controlling the loss value; the loss value of the gaussian kernel module is calculated as follows:
Figure BDA0003137742670000055
wherein E represents expectation, ln represents a logarithmic function with a natural number as the base, KL represents relative entropy, and | is a special mark symbol in the relative entropy, and has no practical meaning. The aim is to make the two probability distributions designed in the calculation as similar as possible, so that in relative entropy, either fixed q (x) or p (x) can satisfy:
Figure BDA0003137742670000056
furthermore, measuring the probability distribution in addition to the relative entropy, there are, for example, the babbit distance:
Figure BDA0003137742670000057
the relative entropy is chosen here because it allows better sampling calculations in neural network calculations.
p (UX) represents the conditional probability, also fitting a Gaussian probability distribution, whose probability is calculated as follows:
Figure BDA0003137742670000058
wherein [ K ] IE ] ij =k(I i ,E j ),K -1 Represents a transposition and
Figure BDA0003137742670000059
furthermore, the countermeasure learning module adopts the following steps of countermeasure learning method:
firstly, a random disturbance generation mode in the countermeasure learning is constructed and expressed as follows:
Figure BDA00031377426700000510
wherein r is adv Representing the random perturbation of the final input, r represents the random perturbation,
Figure BDA0003137742670000061
representing a two-norm, epsilon represents a hyper-parameter,
Figure BDA0003137742670000062
representing the loss function and theta representing the parameter to be learned in the model.
The random perturbation is generated using linear approximation, and is expressed as:
r adv =-εg/||g||
Figure BDA0003137742670000063
wherein g represents a loss function
Figure BDA0003137742670000064
The gradient of E is represented for the input word vector,
Figure BDA0003137742670000065
representing gradient operations, f model operations, and y sample labels. Word vector representation at the encoded layer E ∈ R d Add random perturbation, expressed as:
E+r adv
the resulting representation is added to the event extraction body framework, in particular, each word is divided into n classes, where n is the number of event types, and then, according to each type of predictive label, for each sentence, there are two identical start position and end position classifiers, respectively, the detailed operation of the classifier for each word is as follows,
Figure BDA0003137742670000066
Figure BDA0003137742670000067
wherein
Figure BDA0003137742670000068
Is the probability for all event types of the starting position of the ith word being recognized from the sentence and classified as a trigger,
Figure BDA0003137742670000069
is the probability of the end position of the ith word being recognized from the sentence and classified as a trigger word for all event types, sigmoid is a non-linear activation function, W l And W r Are trainable weights in neural networks, and b l And b r Is a deviation term. The integrated loss value in the challenge learning is calculated as follows:
Figure BDA00031377426700000610
wherein the content of the first and second substances,
Figure BDA00031377426700000611
representing the loss value generated by the event extraction process, P representing the probability of the model predicting the word, P adv Representing the probability of the countervailing learning module predicting the word, L representing the true label of the word to be predicted, and gamma epsilon (0,1) is a hyper-parameter used for balancing the weight of the two parts;
wherein the content of the first and second substances,
Figure BDA0003137742670000071
the loss follows a binary cross-entropy loss function, and is computed as:
Figure BDA0003137742670000072
wherein the content of the first and second substances,
Figure BDA0003137742670000073
representing a loss value calculated via a binary cross entropy loss function, P representing a predicted probability of a word in a sentence, and L representing a set of true tags; t is a set of event types, S is a selected sentence, | · | represents the number of specific objects, k is greater than or equal to 1 and less than or equal to n, and n is the number of event types;
the resulting optimized loss function is synthesized as follows:
Figure BDA0003137742670000074
wherein the content of the first and second substances,
Figure BDA0003137742670000075
representing the loss value calculated by the gaussian kernel function module,
Figure BDA0003137742670000076
and
Figure BDA0003137742670000077
respectively, the loss values calculated by the attack learning module at the start and end positions, and alpha epsilon (0,1) represents the order of magnitude of the hyperparameter used to control the loss value.
Compared with the prior art, the method has the advantages that: a new learning framework DER aiming at the ED problem is provided, the learning framework DER comprises two innovatively designed modules, namely a Gaussian kernel function and counterlearning, and the capability of distinguishing words inside and outside a trigger word is improved; the method of the invention is the first idea of introducing a Gaussian kernel function into the ED, which is orthogonal to the SOTA solution of the ED; a number of experiments on standard datasets have shown that DER models are effective in solving the trigger word segment recognition problem.
Drawings
FIG. 1 is a diagram of an exemplary scenario in an embodiment of the present invention;
fig. 2 shows a schematic flow diagram of a framework of an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present embodiment follows the terminology in the Automatic Content Extraction (ACE) sharing task. The event mention is a phrase or sentence for describing an event, and comprises trigger words and corresponding component elements; trigger words are some of the words that most clearly express event mentions.
The standard tasks of ED on ACE include event-triggered word recognition and corresponding type classification. Consider the example in fig. 1. "death dependency" is an event trigger whose event type is composed of a higher type "just" and a sub-type "Execute", thereby forming a combination type "just: execute ". Thus, given this sentence, ED will predict: 1) "depth dependency" is an event trigger, and the event type is "just: execute "; 2) "connected" is an event trigger, the event type of which is "just: convict ".
Model frame
Fig. 2 is a schematic diagram of an overall DER framework, which mainly includes.
In order to make it possible to properly handle the trigger range detection problem, the present embodiment designs two new modules to upgrade the framework. In short, the DER model consists of three components: the device comprises an encoding module, a Gaussian kernel function module and a confrontation learning module.
A method of event detection based on discriminative word vector representation, the method comprising:
step 1, constructing a discriminative word vector representation model, which comprises a coding module, a Gaussian kernel function module and an antagonistic learning module, wherein the coding module is used for generating each word in a sentence into a representation in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the difference of the representation between the internal component word of a trigger word and other external words, and the antagonistic learning module is used for improving the generalization recognition capability of a positive sample of the trigger word;
step 2, in the coding module, each word of a sentence is embedded into a context word vector representation in a high-dimensional vector space by using a pre-trained BERT model so as to provide input containing semantic features, and simultaneously, information contained in the word representation is further enriched by combining external knowledge of predefined event types;
step 3, in the Gaussian kernel function module, Gaussian kernel function transformation is carried out on the coded word vector representation, and Gaussian processing is utilized to constrain the distribution of the word vectors in Gaussian distribution so as to realize clustering of the word vectors in a high-dimensional space and improve the differential coding capability of the word vectors on trigger words and non-trigger words;
step 4, in the confrontation learning module, adding random disturbance into the word vector during training to enable the model to pay more attention to regular semantic information in the training sample, and further improve the generalization capability of the model triggering word positive sample;
step 5, predicting whether each word belongs to the initial position or the end position of a certain trigger word under the category one by one on all event types by using the trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions;
and 6, detecting an event in the text according to the prediction result of the trigger word.
Specifically, in the coding module, a BERT-based language representation model is used as the encoder, the BERT is composed of a stack of 12 identical transform blocks, each block handles word embedding, position embedding, and segment embedding, after all blocks compute three types of embedding in turn, the BERT outputs their sum as a representation, and at the same time, in the coding module, the self-attention mechanism in the BERT is enhanced by using a predefined trigger word type as external knowledge, and all upper event types are connected to each sentence using only the upper event type as external knowledge, in the following specific form:
[CLS]sentence[SEP]UT 1 [SEP]···[SEP]UT m [SEP],
wherein [ CLS ] represents a start position mark in the BERT, the presence represents an input specific sentence, [ SEP ] represents a separator mark in the BERT, UT is an abbreviation of upper-type, represents an upper-level type of an event, and m is the number of upper-level event categories in the data set.
Specifically, in the gaussian kernel function module:
after passing through the coding module, the expression E E R of each word can be obtained l×d Wherein, R represents a real space with dimension d, l represents a sequence length of an input text, and the whole gaussian kernel function mapping process is composed of two parts, namely, an average word vector representation and a kernel function:
p(X)=N(X|mean(E),K EE ) (1)
X=f(E)
wherein, p represents prior probability, namely Gaussian probability distribution, N represents Gaussian distribution symbol, mean represents average word vector representation in the target sequence, f represents mapping to the word vector representation by a full-connection network, K EE The kernel function is expressed and specifically defined as follows:
[K EE ] ij =k(E i ,E j )=exp(-γ||E i -E j || 2 ) (2)
where k denotes the kernel function operation, E i And E j Respectively, word vectors corresponding to i and j positions in the text, exp denotes a natural exponential function, gamma denotes a hyper-parameter,
Figure BDA0003137742670000101
a norm representing a vector;
using interpolation to extract data samples of a certain scale from the word vectors, the method comprises the following steps:
U={f(I 1 ),f(I 2 ),...,f(I k )} (3)
wherein, U represents a word vector sequence obtained after interpolation, I belongs to R d Representing word vectors acquired out of sequence order; when the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:
p(U)=N(mean(I),K II ) (4)
wherein, K II Expressed kernel function and K EE The same is true.
Furthermore, the countermeasure learning module adopts the following steps of countermeasure learning method:
step 401, constructing a random disturbance generation mode in countermeasure learning, which is expressed as:
Figure BDA0003137742670000111
wherein r is adv Representing the random perturbation of the final input, r represents the random perturbation,
Figure BDA0003137742670000112
representing a two-norm, epsilon represents a hyper-parameter,
Figure BDA0003137742670000113
representing a loss function, and theta represents a parameter needing to be learned in the model;
step 402, generating the random perturbation by using linear approximation, which is expressed as:
r adv =-εg/||g|| (6)
Figure BDA0003137742670000114
wherein g represents a loss function
Figure BDA0003137742670000115
The gradient of E is represented for the input word vector,
Figure BDA0003137742670000116
representing gradient operation, f representing model operation, and y representing sample label;
step 403, the word vector representation E R in the encoded layer d Add random perturbation to, expressed as:
E+r adv
at step 404, a randomly perturbed word vector representation is used as an input to the loss function.
Further, the synthetic loss function during model training is:
Figure BDA0003137742670000117
wherein the content of the first and second substances,
Figure BDA0003137742670000118
representing the loss value calculated by the gaussian kernel function module,
Figure BDA0003137742670000119
and
Figure BDA00031377426700001110
respectively representing the loss values calculated by the confrontation learning module at the starting position and the ending position, wherein alpha epsilon (0,1) represents the magnitude of the order of the hyperparameter used for controlling the loss value; the loss value of the gaussian kernel module is calculated as follows:
Figure BDA0003137742670000121
wherein E represents expectation, ln represents a logarithmic function with a natural number as a base, KL represents relative entropy, | | is a special mark symbol in the relative entropy without practical meaning, and p (U | X) represents conditional probability, which is calculated as follows:
Figure BDA0003137742670000122
wherein [ K ] IE ] ij =k(I i ,E j ),K -1 Represents transposed and
Figure BDA0003137742670000123
q (U | X) represents the prior probability, q is also the posterior probability satisfying the gaussian distribution, and the calculation of q (U | X) is based on a neural network, as follows:
Figure BDA0003137742670000124
wherein, mu and sigma 2 Means and variances representing the output of the neural network, i.e. the q-distribution;
the loss value of the confrontational learning module is calculated as follows:
Figure BDA0003137742670000125
wherein the content of the first and second substances,
Figure BDA0003137742670000126
representing the loss value generated by the event extraction process, P representing the probability of the model predicting the word, P adv Representing the probability of the countervailing learning module predicting the word, L representing the true label of the word to be predicted, and gamma epsilon (0,1) is a hyper-parameter used for balancing the weight of the two parts;
wherein the content of the first and second substances,
Figure BDA0003137742670000127
the loss follows a binary cross-entropy loss function, calculated as:
Figure BDA0003137742670000128
wherein the content of the first and second substances,
Figure BDA0003137742670000129
representing a loss value calculated via a binary cross entropy loss function, P representing a predicted probability of a word in a sentence, and L representing a set of true tags; t is the set of event types, S is the selected sentence, | · | represents the number of specific objects, 1 ≦ k ≦ n, and n is the number of event types.
Setting of experiments
A data set. This example evaluates two standard data sets: 2005 automatic content extraction dataset (ACE 2005). The statistical description about this data set is shown in table 1, which is the most widely used data set in event-related tasks, containing 599 documents. All events are labeled 8 types and 33 seed types. The present embodiment evaluates 33 combination type classifications. Based on previous studies, the present embodiment divided 599 documents into 529 training documents, 30 verification documents and 40 test documents.
Figure BDA0003137742670000131
Table 1 data set analysis
And (4) indexes. The present embodiment evaluates the metrics following the criteria for event detection, which have two aspects: and (4) identifying and classifying. If the segment of the event trigger word is matched with the real trigger word, the event trigger word can be correctly identified; and if the event trigger words and the corresponding event types are matched with the real trigger words and the categories, the event trigger words are considered to be classified correctly. This example reports micro-average accuracy (Pre), recall (Rec) and F1 score (F1) in all assessments.
A baseline. The following 9 methods were used for comparison: (1) the method based on the characteristics comprises the following steps: three representative methods are involved. Cross event utilizes document information with complex functions, MaxEnt only adopts the characteristics of artificial design, and Combined-PSL utilizes global information by using a probability soft logic model. (2) Enhancement-based methods: two representative methods are involved. GMLATT employs a gated cross-language attention mechanism to take advantage of the complementary information of multilingual data transfer, and AD-DMBERT uses a confrontation model to obtain more training data. (3) Neural-based methods: four representative methods are involved. DMCNN uses CNN to automatically extract features, GCN-ED uses graph convolution network to capture grammatical information, JMEE uses multi-language information to perform more accurate context modeling, DISTILL introduces delta learning method to refine generalized knowledge.
Overall results
Figure BDA0003137742670000141
Table 2 overall results on ACE 2005. In addition to the DER results, other results were drawn from the original papers.
In all tasks, the DER model is superior to the comparative method, and SOTA performance is achieved in all indexes, which proves the superiority of the module provided by the embodiment and the effectiveness of solving the trigger fragment detection problem. In particular, for the task of type classification, the DER model outperforms the SOTA method AD-DMBERT 0.2% based on data enhancement and the SOTA method DISTILL 1.3% based on neural, in the case of F1.
In the enhancement-based approach, although the performance of the approach in this branch is relatively good, the generated samples are inaccurate and unbalanced, which can lead to problems of overfitting. This problem makes the model perform well only on samples that have been present in the training data, but lacks generalization capability, resulting in high final prediction accuracy but low recall. The DER model does not require an additional/external data source compared to the enhancement-based approach. In the representation-based approach, many complex structures are designed for the ED task, which also leads to overfitting problems. In contrast to representation-based approaches, DER models do not rely on complex structures.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (5)

1. A method for event detection based on discriminative word vector representation, the method comprising:
step 1, constructing a discriminative word vector representation model, which comprises a coding module, a Gaussian kernel function module and an antagonistic learning module, wherein the coding module is used for generating each word in a sentence into a representation in a high-dimensional vector space, the Gaussian kernel function module is used for increasing the difference of the representation between the internal component word of a trigger word and other external words, and the antagonistic learning module is used for improving the generalization recognition capability of a positive sample of the trigger word;
step 2, in the coding module, each word of a sentence is embedded into a context word vector representation in a high-dimensional vector space by using a pre-trained BERT model so as to provide input containing semantic features, and simultaneously, information contained in the word representation is further enriched by combining external knowledge of predefined event types;
step 3, in the Gaussian kernel function module, Gaussian kernel function transformation is carried out on the coded word vector representation, and the distribution of the word vectors is constrained in Gaussian distribution by Gaussian processing, so that clustering of the word vectors in a high-dimensional space is realized, and the differential coding capability of the word vectors on the trigger words and the non-trigger words is improved;
step 4, in the confrontation learning module, adding random disturbance into the word vector during training to enable the model to pay more attention to regular semantic information in the training sample, and further improve the generalization capability of the model triggering word positive sample;
step 5, predicting whether each word belongs to the initial position or the end position of a certain trigger word under the category one by one on all event types by using the trained discriminative word vector representation model, and then outputting all possible trigger words by combining the predicted initial positions and end positions;
and 6, detecting an event in the text according to the prediction result of the trigger word.
2. The method of claim 1, wherein in the coding module, a BERT-based language representation model is used as the coder, the BERT is composed of a stack of 12 identical transform blocks, each block handles word embedding, position embedding and segment embedding, after all blocks compute three types of embedding in turn, the BERT outputs their sum as a representation, and at the same time, in the coding module, the self-attention mechanism in the BERT is enhanced using predefined trigger word types as external knowledge, all upper event types are connected with each sentence using only the upper event type as external knowledge, in the following specific form:
[CLS]sentence[SEP]UT 1 [SEP]…[SEP]UT m [SEP],
where [ CLS ] represents a start position flag in BERT, sense represents a specific sentence inputted, [ SEP ] represents an interval flag in BERT, UT is an abbreviation of upper-type, represents an upper type of an event, and m is the number of upper event categories in a data set.
3. The method of claim 2, wherein the Gaussian kernel function module comprises:
after passing through the coding module, the expression E E R of each word can be obtained l×d Wherein, R represents a real space with dimension d, l represents a sequence length of an input text, and the whole gaussian kernel function mapping process is composed of two parts, namely, an average word vector representation and a kernel function:
p(X)=N(X|mean(E),K EE ) (1)
X=f(E)
wherein p represents prior probability, i.e., Gaussian probability distribution, and N represents GaussianDistributing symbols, mean representing the average word vector representation in the target sequence, f representing the mapping of the word vector representation by the fully-connected network, K EE The kernel function is expressed and specifically defined as follows:
[K EE ] ij =k(E i ,E j )=exp(-γ||E i -E j || 2 ) (2)
where k denotes the kernel function operation, E i And E j Respectively, word vectors corresponding to i and j positions in the text, exp denotes a natural exponential function, gamma denotes a hyper-parameter,
Figure FDA0003137742660000021
a norm representing a vector;
using interpolation method to extract data sample in word vector to obtain a certain scale, including:
U={f(I 1 ),f(I 2 ),...,f(I k )} (3)
wherein, U represents a word vector sequence obtained after interpolation, I belongs to R d Representing word vectors that are acquired out of sequence order; when the interpolation value reaches a certain scale, the probability distribution of the interpolation value also accords with Gaussian distribution, and at the moment:
p(U)=N(mean(I),K II ) (4)
wherein, K II Expressed kernel function and K EE The same is true.
4. The event detection method based on the discriminative word vector representation as claimed in claim 3, wherein the counterstudy module adopts the following counterstudy method steps:
step 401, constructing a random disturbance generation mode in countermeasure learning, which is expressed as:
Figure FDA0003137742660000031
wherein r is adv Representing the random perturbation of the final input, r representsThe random disturbance is carried out on the surface of the material,
Figure FDA0003137742660000032
representing a two-norm, epsilon represents a hyperparameter,
Figure FDA0003137742660000033
representing a loss function, and theta represents a parameter needing to be learned in the model;
step 402, generating the random perturbation by using linear approximation, which is expressed as:
r adv =-εg/||g|| (6)
Figure FDA0003137742660000034
wherein g represents a loss function
Figure FDA0003137742660000035
The gradient of E is represented for the input word vector,
Figure FDA0003137742660000036
representing gradient operation, f representing model operation, and y representing sample label;
step 403, the word vector in the encoded layer represents E ∈ R d Add random perturbation to, expressed as:
E+r adv
at step 404, a randomly perturbed word vector representation is used as an input to the loss function.
5. The method of claim 3, wherein the synthetic loss function in the model training process is as follows:
Figure FDA0003137742660000037
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003137742660000041
representing the loss value calculated by the gaussian kernel function module,
Figure FDA0003137742660000042
and
Figure FDA0003137742660000043
respectively representing the loss values calculated by the confrontation learning module at the starting position and the ending position, wherein alpha epsilon (0,1) represents the magnitude of the order of the hyperparameter used for controlling the loss value; the loss value of the gaussian kernel module is calculated as follows:
L G =E[-lnp(U|X)]+KL[q(U|X)||p(U)] (9)
where E represents expectation, ln represents a logarithmic function with a natural number as the base, KL represents relative entropy, | | represents a special notation in the relative entropy, without practical meaning, and p (U | X) represents conditional probability, which is calculated as follows:
Figure FDA0003137742660000044
wherein [ K ] IE ] ij =k(I i ,E j ),K -1 Represents a transposition and
Figure FDA0003137742660000045
q (U | X) represents the prior probability, q is also the posterior probability satisfying the gaussian distribution, and the calculation of q (U | X) is based on a neural network, as follows:
Figure FDA0003137742660000046
wherein, mu and sigma 2 Means and variances representing the output of the neural network, i.e. the q-distribution;
the loss value of the confrontational learning module is calculated as follows:
Figure FDA0003137742660000047
wherein the content of the first and second substances,
Figure FDA0003137742660000048
representing the loss value generated by the event extraction process, P representing the probability of the model predicting the word, P adv Representing the probability of the countervailing learning module predicting the word, L representing the true label of the word to be predicted, and gamma epsilon (0,1) is a hyper-parameter used for balancing the weight of the two parts;
wherein the content of the first and second substances,
Figure FDA0003137742660000049
the loss follows a binary cross-entropy loss function, calculated as:
Figure FDA00031377426600000410
wherein the content of the first and second substances,
Figure FDA00031377426600000411
representing a loss value calculated via a binary cross entropy loss function, P representing a predicted probability of a word in a sentence, and L representing a set of true tags; t is the set of event types, S is the selected sentence, | · | represents the number of specific objects, 1 ≦ k ≦ n, and n is the number of event types.
CN202110726463.6A 2021-06-29 2021-06-29 Event detection method based on differential word vector representation Active CN113282714B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110726463.6A CN113282714B (en) 2021-06-29 2021-06-29 Event detection method based on differential word vector representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110726463.6A CN113282714B (en) 2021-06-29 2021-06-29 Event detection method based on differential word vector representation

Publications (2)

Publication Number Publication Date
CN113282714A CN113282714A (en) 2021-08-20
CN113282714B true CN113282714B (en) 2022-09-20

Family

ID=77286273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110726463.6A Active CN113282714B (en) 2021-06-29 2021-06-29 Event detection method based on differential word vector representation

Country Status (1)

Country Link
CN (1) CN113282714B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468333B (en) * 2021-09-02 2021-11-19 华东交通大学 Event detection method and system fusing hierarchical category information
CN113806490B (en) * 2021-09-27 2023-06-13 中国人民解放军国防科技大学 Text universal trigger generation system and method based on BERT sampling
CN114707517B (en) * 2022-04-01 2024-05-03 中国人民解放军国防科技大学 Target tracking method based on open source data event extraction

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965819B (en) * 2015-07-12 2017-12-26 大连理工大学 A kind of biomedical event trigger word recognition methods based on syntax term vector
CN112148832B (en) * 2019-06-26 2022-11-29 天津大学 Event detection method of dual self-attention network based on label perception
CN111767402B (en) * 2020-07-03 2022-04-05 北京邮电大学 Limited domain event detection method based on counterstudy

Also Published As

Publication number Publication date
CN113282714A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN110209823B (en) Multi-label text classification method and system
CN108984526B (en) Document theme vector extraction method based on deep learning
CN113282714B (en) Event detection method based on differential word vector representation
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
Isa et al. Indobert for indonesian fake news detection
CN115408525B (en) Letters and interviews text classification method, device, equipment and medium based on multi-level label
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
CN114781375A (en) Military equipment relation extraction method based on BERT and attention mechanism
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN113268974A (en) Method, device and equipment for marking pronunciations of polyphones and storage medium
Prabhakar et al. Performance analysis of hybrid deep learning models with attention mechanism positioning and focal loss for text classification
CN114722818A (en) Named entity recognition model based on anti-migration learning
Tang et al. Interpretability rules: Jointly bootstrapping a neural relation extractorwith an explanation decoder
Tkachenko et al. Neural Morphological Tagging for Estonian.
CN114357166A (en) Text classification method based on deep learning
Devkota et al. Knowledge of the ancestors: Intelligent ontology-aware annotation of biological literature using semantic similarity
CN112528657A (en) Text intention recognition method and device based on bidirectional LSTM, server and medium
Prajapati et al. Automatic Question Tagging using Machine Learning and Deep learning Algorithms
Lee et al. A two-level recurrent neural network language model based on the continuous Bag-of-Words model for sentence classification
CN113987090B (en) Sentence-in-sentence entity relationship model training method and sentence-in-sentence entity relationship identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant