CN115062146A

CN115062146A - Chinese overlapping event extraction system based on BilSTM combined with multi-head attention

Info

Publication number: CN115062146A
Application number: CN202210656832.3A
Authority: CN
Inventors: 甘玲; 张在军; 刘菊; 胡柳慧
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-16
Anticipated expiration: 2042-06-07
Also published as: CN115062146B

Abstract

The invention relates to a Chinese overlapping event extraction system based on BilSTM combined with multi-head attention, which belongs to the field of natural language processing.A Bert encoder is used as a text encoder to generate text representation with a mark context as a condition and contains rich text information; the event type detection decoder classifies the events based on a Bert text classification model; the trigger word extraction decoder extracts the trigger words according to the acquired event types through the condition dependence relationship between the event type detection and the trigger word extraction; the event element extraction decoder extracts event elements by combining multi-head attention with a bidirectional LSTM layer; the loss weight adjustment module dynamically assigns a weight to each task using the covariance uncertainties of the multiple targets in conjunction with the multiple loss functions.

Description

Chinese overlapping event extraction system based on BilSTM combined with multi-head attention

Technical Field

The invention belongs to the field of natural language processing, and relates to a Chinese overlapping event extraction system based on BilSTM combined with multi-head attention.

Background

With the development of internet technology, a large amount of information is stored in a computer in a text form, and how to mine valuable information becomes a core problem of information extraction. Event extraction is one of research hotspots in the field of information extraction, and the core task is to extract information of a specified type from unstructured natural language text and express the information in a semi-structured or structured form.

At present, most of mainstream research methods are based on neural networks, and features are extracted through the networks. Chen et al propose a method based on a dynamic multicell convolutional neural network, using a dynamic multicell layer to extract information according to trigger words and event elements; zeng et al propose to extract sentence features with bidirectional LSTM and CRF, and extract semantic features with convolutional neural network to extract Chinese events; chen et al propose a remote supervision method using a knowledge base to generate large-scale annotation data, which is applied to the financial field; liu et al propose a joint multi-event extraction framework for overlapping event extraction; yang et al propose a method for separating event elements according to roles to solve the role overlap problem.

In the prior art, there are the following problems: (1) for the extraction of Chinese financial event elements with complex application scenes, the events contain more undefined company names and professional vocabularies. Due to the insufficient extracted features, the problem of low recognition recall rate exists. (2) For a multi-task joint learning model, underlying network parameters are shared, so that the convergence of the model is prone to be biased to a task with a large loss weight, and the error propagation problem is caused.

Disclosure of Invention

In view of this, the present invention aims to provide a chinese overlapping event extraction system based on BiLSTM and multi-head attention, which uses multi-head self-attention fusion bidirectional LSTM to identify event elements, can better perform feature extraction, obtain richer semantic information, and solve the problem of low event element identification recall rate caused by insufficient extracted features in a multi-task joint learning model.

In order to achieve the purpose, the invention provides the following technical scheme:

a Chinese overlapping event extraction system based on BilSTM combined with multi-head attention comprises a Bert encoder, an event type detection decoder, a trigger word extraction decoder, an event element extraction decoder and a loss weight adjustment module;

the Bert encoder is used as a text encoder to generate text representation with the marked context as a condition and contains rich text information;

the event type detection decoder classifies events based on a Bert text classification model;

the trigger word extraction decoder extracts the trigger words according to the acquired event types through condition dependence between event type detection and trigger word extraction;

the event element extraction decoder extracts event elements by combining multi-head attention with a bidirectional LSTM layer;

the loss weight adjustment module dynamically allocates a weight for each task using the covariance uncertainty of the multiple targets in conjunction with the multiple loss functions.

Further, the event type detection decoder takes the first token position output by the last layer as the representation of a sentence based on the Bert text classification model, and then connects the full connection layer for classification, specifically comprising the following steps:

s11: first, an embedded matrix is initialized

Embedding for type, wherein E represents a set of event types, d is a word vector dimension (d 768);

s12: measuring the correlation between the candidate type C epsilon C and the mark representation through a similarity function delta;

s13: characterization s by measuring adaptive sentences with the same similarity function δ _c And c, predicting the event type by the similarity of the type embedding c.

Further, the trigger word extraction decoder establishes a condition dependency relationship model between event type detection and trigger word extraction by using a condition fusion function, models the condition dependency between the type detection and the trigger word extraction, and further refines the representation of the trigger word extraction through a self-attention layer.

Further, the event element extraction decoder firstly uses a condition fusion function phi to carry out dependency modeling on event types, trigger words and event elements, and then carries out feature extraction; the representation of event element extraction is refined using multi-head attention in combination with a bi-directional LSTM layer:

Z ^ct ＝[Z ^ct′ ；P] (1)

Y ^ct ＝[Y ^ct′ ；P] (2)

X ^ct ＝[Z ^ct ；Y ^ct ] (3)

wherein

Is a relative position embedding, d _p Is dimension, Z ^ct Is a matrix representation after passing through a bi-directional LSTM layer, Y ^ct Is a matrix representation after passing through a multi-headed attention layer, X ^ct The matrix representation after the two layers of networks are fused and spliced is shown in formulas (1) to (3), and then the regularization is used for reducing the dimension;

finally, an indicator function I (r, c) is used to indicate whether the role is of a type according to a predefined event pattern, the expression formula being shown in (4):

predicting event elements using a pair of taggers markers, wherein

Represents X ^c The ith tokens in (a), the representation of the event element start and end positions is shown in equation (5) (6):

selecting a value

The result of (2) is a predicted starting position, a selected value

The result of (1) is a predicted end position, ξ ₄ ,ξ ₅ ∈[0,1]Is a scalar threshold; by enumerating all the starting positions, the nearest ending position in the sentence is searched, and the mark between the starting position and the ending position forms a complete event element.

Further, the loss weight adjustment module implements the steps of:

manually setting the initialization weight, combining a plurality of loss functions, and simultaneously utilizing the covariance uncertainties of a plurality of targets, as shown in formula (7), re-assigning a weight to each task:

l′＝1/(2σ^2)·l+log(1+1/σ^2) (7)

where σ represents the standard deviation of the gaussian distribution, l represents the loss of the single task part, and l' represents the loss of the single task after the weight update.

The invention has the beneficial effects that: the invention improves the extraction capability of character information in the sentence, enhances the extraction capability of sentence structure, and can transmit information in a long distance, thereby better extracting the characteristics and acquiring richer semantic information. Considering that the convergence direction tends to a certain task due to the fact that the loss magnitude difference among tasks of the joint learning multitask model is large, the method for dynamically setting the loss weight is adopted, specifically, the weight is distributed to each task again according to the loss proportion of each task, so that the loss of each task is in the same magnitude, the convergence direction of the whole model is optimized, and the generalization of the model is improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a structural diagram of a Chinese overlap event extraction system based on BilSTM combined with multi-head attention according to the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

The invention relates to a Chinese overlapping event extraction model based on BilSTM combined with multi-head attention, which comprises a shared Bert encoder, an event type detection decoder, a trigger word extraction decoder, an event element extraction decoder and a loss weight adjustment module. The structure is shown in fig. 1.

1) Bert encoder

To share the textual representation of each sentence, the overall model employs Bert as the text coder. Bert is a two-way language representation model based on the transform framework, which generates a text representation conditioned on markup context and contains rich text information.

2) Event type detection decoder

And (3) taking the first token position output by the last layer as the representation of a sentence based on the Bert text classification model, and then connecting the full-connection layers for classification. The method comprises an attention layer, a similarity prediction layer and an event type classification layer.

First, an embedded matrix is initialized

Is type embedding, where E denotes the set of event types and d is the word vector dimension (d 768). The correlation between the candidate type C and the token representation is measured by a similarity function δ. Finally, s is characterized by measuring adaptive sentence tokens with the same similarity function δ _c Type-embedding c to predict event type.

3) Trigger word extraction decoder

Including a conditional fusion layer, a self-attention layer, and a normalization layer. And establishing a condition dependency relationship model between event type detection and trigger word extraction by using a condition fusion function, modeling the condition dependency between the type detection and the trigger word extraction, and further refining the representation of the trigger word extraction through a self-attention layer.

4) Event element extraction decoder

The method comprises a conditional fusion layer, a BilSTM layer, a multi-head self-attention layer, a normalization layer and an indicator function layer. The event elements are identified based on an information extraction mode of a self-attention mechanism, important information can be ignored, the bidirectional LSTM has strong extraction capacity on character information in sentences, but structural information of the sentences is not considered, the parallel computing capacity is not strong, and the weakness of the LSTM is overcome by utilizing the characteristics that the multi-head attention has strong extraction capacity on sentence structures and information can be transmitted in a long distance. Therefore, the invention adopts multi-head attention combined with a bidirectional LSTM layer to further refine the representation of event element extraction, and performs dimension reduction by using regularization after fusion splicing of two layers of networks.

In the embodiment, a conditional fusion function phi is used to perform dependency modeling on event types, trigger words and event elements, and then feature extraction is performed. The representation of event element extraction is further refined using multi-headed attention in combination with a bi-directional LSTM layer.

Z ^ct ＝[Z ^ct′ ；P] (1)

Y ^ct ＝[Y ^ct′ ；P] (2)

X ^ct ＝[Z ^ct ；Y ^ct ] (3)

Wherein

Is a relative position embedding, d _p Is dimension, Z ^ct Is a matrix representation after passing through a bi-directional LSTM layer, Y ^ct Is a matrix representation after a multi-head attention horizon, X ^ct The matrix representation after the two layers of networks are fused and spliced is shown in formulas (1) to (3), and then dimension reduction is carried out by using regularization.

event elements are also predicted using a pair of taggers markers, where

selecting a value

The result of (2) is a predicted starting position, a selected value

The result of (1) is a predicted end position, ξ ₄ ,ξ ₅ ∈[0,1]Is a scalar threshold. By enumerating all the starting positions, the nearest ending position in the sentence is searched, and the mark between the starting position and the ending position forms a complete event element.

5) Loss weight adjustment module

The loss magnitude difference among tasks is large, when a model tends to fit a certain task, the convergence direction is dominated by the task, the overall training effect is often deteriorated, and the problem of wrong cascading is caused.

The specific operation of the loss weight adjustment module of this embodiment is to manually set the initialization weight, combine multiple loss functions, and simultaneously use the covariance uncertainties of multiple targets, as shown in equation (7), to re-assign a weight to each task.

l′＝1/(2σ^2)·l+log(1+1/σ^2) (7)

In this embodiment, an experiment is performed on a financial event extraction data set FewFC, and the data set is divided into a training set, a verification set and a test set at a ratio of 8:1: 1. The data division and the composition of the data amount are shown in table 1.

TABLE 1

In the present embodiment, the accuracy P, the recall ratio R and the comprehensive evaluation F1 are used as evaluation indexes, and the calculation method is as shown in formulas (18) to (20):

where TP represents the number of positive classes predicted as positive classes, FN represents the number of positive classes predicted as negative classes, and FP represents the number of negative classes predicted as positive classes.

The experimental environment of this embodiment is based on the PyTorch framework, an NVIDIA TESLA P100 GPU training model is adopted, a Chinese Bert-Base-Chinese model is used as a text encoder, there are 12 layers, 768 hidden units and 12 attention heads, and an Adam optimizer is used to train the model. The hyper-parameters in the various methods are shown in table 2.

TABLE 2

For the Bert parameter, the initial learning rate is adjusted to [1e-5,5e-5], and the pre-heating rate of the learning rate is 10%. The number of heads of multi-head attention of the model is adjusted within {2,4,8,16}, the layer number of the bidirectional LSTM is adjusted within {1,2,3}, and the initial value of weight attenuation is [1,1,0.2 ].

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A Chinese overlapping event extraction system based on BilSTM combined with multi-head attention is characterized in that: the device comprises a Bert encoder, an event type detection decoder, a trigger word extraction decoder, an event element extraction decoder and a loss weight adjusting module;

the loss weight adjustment module dynamically assigns a weight to each task using the covariance uncertainties of the multiple targets in conjunction with the multiple loss functions.

2. The system for extracting Chinese overlapping events based on BilSTM combined with multi-head attention as claimed in claim 1, wherein: the event type detection decoder takes the first token position output by the last layer as the representation of a sentence based on a Bert text classification model, and then is connected with a full connection layer for classification, and the method specifically comprises the following steps:

s11: first, an embedded matrix is initialized

s13: characterization s by measuring adaptive sentences with the same similarity function δ _c Type-embedding c to predict event type.

3. The system for extracting Chinese overlapping events based on BilSTM combined with multi-head attention as claimed in claim 1, wherein: the trigger word extraction decoder establishes a condition dependency relationship model between event type detection and trigger word extraction by using a condition fusion function, models the condition dependency between the type detection and the trigger word extraction, and further refines the representation of the trigger word extraction through a self-attention layer.

4. The system for extracting Chinese overlapping events based on BilSTM combined with multi-head attention as claimed in claim 1, wherein: the event element extraction decoder firstly uses a condition fusion function phi to carry out dependency modeling on event types, trigger words and event elements and then carries out feature extraction; the representation of event element extraction is refined using multi-head attention in combination with a bi-directional LSTM layer:

Z ^ct ＝[Z ^ct′ ；P] (1)

Y ^ct ＝[Y ^ct′ ；P] (2)

X ^ct ＝[Z ^ct ；Y ^ct ] (3)

wherein

predicting event elements using a pair of taggers markers, wherein

selecting a value

Is connected withSelecting a value if the predicted starting position is obtained

5. The system for extracting Chinese overlapping events based on BilSTM combined with multi-head attention as claimed in claim 1, wherein: the loss weight adjusting module comprises the following implementation steps:

manually setting initialization weights, combining a plurality of loss functions, and simultaneously utilizing the covariance uncertainties of a plurality of targets, as shown in formula (7), re-assigning weights to each task:

l′＝1/(2σ^2)·l+log(1+1/σ^2) (7)