CN111507107A

CN111507107A - Sequence-to-sequence-based extraction method for alert condition record events

Info

Publication number: CN111507107A
Application number: CN202010292535.6A
Authority: CN
Inventors: 曾道建; 田剑; 韩光洁; 谢依玲; 赵超; 唐勇
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-08-07

Abstract

The invention discloses a sequence-to-sequence-based extraction method for an alarm condition record event, which comprises the following steps of: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector coding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.

Description

Sequence-to-sequence-based extraction method for alert condition record events

Technical Field

The invention relates to the field of natural language processing and deep learning, in particular to a sequence-to-sequence-based extraction method for an alert situation record event.

Background artthe current methods for extracting events from an alert statement are classified into two categories: one type is a series extraction method, namely the extraction method of the warning condition record event is divided into two subtasks: triggering word recognition classification and argument recognition classification, and constructing a complete alarm situation record event through post-processing. However, the series method is divided into two subtasks, so that the subtasks are independent from each other and cannot utilize mutual information and cascade errors between the subtasks, and error conduction is caused. The other method is a combined extraction method, namely, the trigger words, the types, the arguments and the argument roles are identified simultaneously to construct the alert condition record events. Although the joint extraction method makes the dependence between the subtasks more compact by sharing parameters, the problem of error transmission caused by cascading errors still exists. Meanwhile, through careful analysis of the data of the warning condition record, the event triggering conditions in the warning condition record are found to be divided into two types: firstly, triggering is carried out through a trigger word; the second is triggered by event statements. However, the current method for extracting events from the alert record is to recognize the trigger words first and then classify the events, and therefore, the problem of extracting all events from the alert record cannot be solved due to the limitation of the method.

The invention provides a sequence-to-sequence-based extraction method for an alarm condition record event, which aims to solve the problems that the existing extraction method for the alarm condition record has error conduction, cannot fully utilize interactive information among subtasks and does not have trigger words in the event of the alarm condition record data. The method not only can solve the problems that the error conduction in the extraction of the current warning condition record event cannot fully utilize the interactive information among all subtasks, but also can solve the problem that no trigger word exists in the event in the warning condition record data. The invention improves the accuracy and recall rate of extracting the events of the alert condition record.

Disclosure of Invention

In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:

a sequence-to-sequence-based extraction method for an alert condition record event comprises the following steps: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector encoding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.

The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 1: with X ═ X₁,x₂,...,x_nRepresents an alert writing data unit, x_nRepresenting a character in data, where n is the length of a data unit; firstly, dividing an alert condition record data unit into one character; and identifying words according to a preset word list and storing the results of the identified words until the cycle number is ended, thereby generating preprocessed data W ═ W { (W)₁,w₂,...,w_kIn which w_kDenotes a word in the preprocessed data, where k is the length of the preprocessed data (k)<＝n)。

The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 2:

2.1 vectorizing the preprocessed data W to generate a word vector: w and a position P ═ P for each word in the representation W₁,p₂,...,p_kAre converted into vectors respectively

And

2.2 pairs of

Vector conversion is performed by BERT.

The input BERT first generates three vectors of Q, K, V:

next, the weight between each input vector is calculated according to the Q, K vectors:

to stabilize the gradient, score was normalized by dividing by

score multiplied by V to obtain the score for each input vector is calculated from attention:

and then adding self _ attribute corresponding to each input vector to obtain an output Z:

Z＝∑self_attention，

and finally, obtaining an output vector O of one layer by the output Z through a two-layer full-connection network:

O＝max(0,ZW₁+b₁)W₂+b₂，

where max is a function of large value. To better capture the context information of a sentence, multiple self-Attention processes are performed on the input vector, namely, a Multi-Head self-Attention approach:

Z_i＝self_attention_ii∈{1,....n}，

where n represents the number of heads, then i Z_iSplicing into a characteristic matrix, and obtaining output Z through a full-connection network:

Multi-Head＝Concat(Z₀,...,Z_i)，

Z＝Multi-Head*W+b。

wherein W^Q,W^K,W^V,W,W₁,W₂Represents a weight, b_q,b_k,b_v,b,b₁,b₂Representing deviation, Q, K, V representing query vector, key vector and value vector, respectively, d_kIs the square root of the K vector dimension and the Concat function is the splicing function. BERT uses 12 layers of the above network, each layer using 12 multi-head attentions, ultimately generating a word vector V of dimension 768_o＝{v₁,v₂,...,v_k}。

The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 3: vector V of words_OInputting a bidirectional L STM network, where the input is x at time t_ktForgetting door f_tInformation to decide to discard or keep:

input door i_tFor updating neuron state:

current neuron state C_tExpressed as:

output gate

The value used to determine the next hidden state:

finally obtaining the current neuron output

Wherein W_f,W_i,W_C,W_oRepresents a weight, b_f,b_i,b_C,b_oThe deviation is represented by the difference in the,

output representing last moment, C_t-1Representing the state of the neuron at the last moment, tanh being an activation function, sigma representing a sigmod function, L the backward reasoning of STM being the same as the forward reasoning, and finally generating a fixed dimension vector

The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of:

4.1 initialize L STM network at decoding end with S vector generated in step 3, i.e. set h₀ ^DSetting the target warning condition record event vector sequence T as S and y as₀,y₁,y₂,...,y_mIn which y is₀And y_mIs an embedded vector of decoding start identifier SOS and end identifier EOS, h_t ^DAnd o_t ^D(t>1) is the hidden layer state and decoded output vector of the decoding module step t, o_t(t>1) is the event element output of the decoding module at step t, e_tIs that the t step of the decoding module is based onAttention mechanism at the encoding and decoding end, y_t∈T(0<t<m) is the sequence of event vectors at step t of the decoding module, [ e ]_t:y_t-1]The decoding input of the decoding module in the t step is as follows:

h_t ^D＝LSTM([e_t；y_t-1],h_t-1 ^D)，

o_t ^D＝w_vh_t ^D+b_v，

o_t＝soft max(o_t ^D)，

uⁱ _t＝w_uh_i ^t，qⁱ _t＝w_qh_t-1 ^D+b_q，

a_t ⁱ＝v_atanh(q_t ⁱ+u_t ⁱ)，a_t＝soft max(a_t)，

y_t＝embedding(o_t)0<t<m，

wherein w_v,w_u,w_q,v_aRepresents a weight, b_v,b_qThe deviation is indicated.

4.2 when decoding, the input SOS starts decoding, firstly decoding to generate an event type, and then generating a special separator'; ', event argument, special separator': ' argument role, and analogize decoding output event until the end of input EOS decoding, specifically decoding as follows:

t 1 step input [ e ═ b₁；y₀]Decoding starts, generating event type o₁；

t is 2-step input [ e ═ b₂；y₁]Generating special characters'; ', output o₂；

t is 3-step input [ e ═ c₃；y₂]Generating event argument and outputting o₃；

t is 4-step input [ e ═ c₄；y₃]Generating a special character': ', output o₄；

t is 5 step input [ e ═ b₅；y₄]Generating argument role, outputting o₅；

And so on until the input [ e ]_m+1:y_m]Decoding is finished and output { o₁,o₂,...,o_m-1}；

4.3, decoding and outputting to construct an alert condition record event: an event type; event argument: argument roles; event argument: argument role | event type; event argument: argument roles; event argument: recording events of the alert condition of the argument role, wherein; ' is a special separator between the event type and the event argument, and between the event argument and the event argument; ': ' is a separator between event argument and argument role directly, ' | ' is a separator between event and event;

and 5, repeatedly executing the steps 1-4, and extracting the events of all the warning condition record data units in the warning condition record file, namely finishing the extraction of the events of all the warning condition record files.

Drawings

FIG. 1 is a flow chart of a sequence-to-sequence based alert entry event extraction method;

FIG. 2 is a system block diagram of a sequence-to-sequence based alert transcript event extraction method.

Detailed Description

The following detailed description of the present invention will be made with reference to the accompanying drawings 1-2.

The invention provides a sequence-to-sequence-based extraction method for an alarm condition record event.

The method comprises the following concrete implementation steps:

step 1, preprocessing the alert condition record data: and segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence. The method comprises the following specific steps:

with X ═ X₁,x₂,...,x_nRepresents an alert writing data unit, x_nRepresenting characters in data, where n is the length of a data unitInputting X into a word segmentation module, dividing an alarm condition stroke record data unit into one character by a BPE (double byte encoding) algorithm in the word segmentation module, counting the times of the occurrence of a single character or a combination of adjacent characters in the stroke record data unit in a preset word list, storing the single character or the adjacent character combination with the most times as a word each time until the cycle times are finished, and generating preprocessed data W ═ W-₁,w₂,...,w_k}，w_kDenotes a word in the preprocessed data, where k is the length of the preprocessed data (k)<＝n)；

Step 2, vectorization coding: the preprocessed sentences in the text are used as the input of the coding module in each training, and the preprocessed sentences are vectorized. The method specifically comprises the following steps: the preprocessed data W is vectorized using BERT (semantic understanding depth bi-directional converter pre-training model) to generate word vectors. W and a position P ═ P for each word in the representation W₁,p₂,...,p_kRespectively converting word2vector into vector through word representation module

And

will be provided with

The input BERT first generates three vectors of Q, K, V:

to stabilize the gradient, score was normalized by dividing by

the resulting output Z of the self _ entries for each input vector is then added:

Z＝∑self_attention，

and finally, obtaining a final output vector O by the output Z through a two-layer full-connection network:

O＝max(0,ZW₁+b₁)W₂+b₂，

where max is a maximum function. To better capture the context information of a sentence, multiple self-Attention processes are performed on the input vector, namely, a Multi-Head self-Attention approach:

Z_i＝self_attention_ii∈{1,....n}，

Multi-Head＝Concat(Z₀,...,Z_i)，

Z＝Multi-Head*W+b。

wherein W^Q,W^K,W^V,W,W₁,W₂Represents a weight, b_q,b_k,b_v,b,b₁,b₂Indicating deviationQ, K, V represent the query vector, the key vector and the value vector, respectively, d_kIs the square root of the K vector dimension and the Concat function is the splicing function. BERT uses 12 layers of the above network, each layer using 12 multi-head attentions, ultimately generating a word vector V of dimension 768_o＝{v₁,v₂,...,v_k}。

Step 3, carrying out fixed dimension vector encoding on the word vector V generated in the step 2, namely, inputting the word vector V generated in the step 2 into a bidirectional L STM network for encoding, and generating a vector S with fixed dimension by encoding so as to capture more comprehensive context semantic features, wherein the method comprises the following steps:

inputting a word vector V into a bidirectional L STM network, and inputting x at the moment t_kt(one-way L STM is used as an example for explanation), forgetting gate f_tInformation to decide to discard or keep:

input door i_tFor updating neuron state:

current neuron state C_tExpressed as:

output gate

The value used to determine the next hidden state:

finally obtaining the current neuron output

Step 4, decoding the fixed dimension vector S, generating an event type, an event argument and an argument role thereof and constructing a complete alarm record event, namely, initializing a unidirectional L STM network of a decoding end by using the vector S generated in the step 3, and decoding and generating the event type, the event argument and the argument role thereof to construct the complete alarm record event₀ ^DS), set the target alert scenario record event vector sequence T ═ y₀,y₁,y₂,...,y_mIn which y is₀And y_mIs an embedded vector of decoding start identifier SOS and end identifier EOS, h_t ^DAnd o_t ^D(t>1) is the hidden layer state and decoded output vector of the decoding module step t, o_t(t>1) is the event element output of the decoding module at step t, e_tThe t step of the decoding module is based on the attention mechanism of the encoding and decoding ends, y_t∈T(0<t<m) is the sequence of event vectors at step t of the decoding module, [ e ]_t:y_t-1]The decoding input of the decoding module in the t step is as follows:

h_t ^D＝LSTM([e_t；y_t-1],h_t-1 ^D)，

o_t ^D＝w_vh_t ^D+b_v，

o_t＝soft max(o_t ^D)，

uⁱ _t＝w_uh_i ^t，qⁱ _t＝w_qh_t-1 ^D+b_q，

a_t ⁱ＝v_atanh(q_t ⁱ+u_t ⁱ)，a_t＝soft max(a_t)，

y_t＝embedding(o_t)0<t<m，

wherein w_v,w_u,w_q,v_aRepresents a weight, b_v,b_qThe deviation is indicated.

When decoding, the input SOS starts decoding, firstly decoding to generate an event type, and then generating a special separator'; ', event argument, special separator': ' argument role, and analogize decoding output event until the end of input EOS decoding, specifically decoding as follows:

And so on until the input [ e ]_m+1:y_m]Decoding is finished, and event { o } is output₁,o₂,...,o_m-1}; constructing an alert condition record event by decoding output as follows: an event type; event argument: argument roles; event argument: argument role | event type; event argument: argument roles; event argument: recording events of the alert condition of the argument role, wherein; ' is a special separator between the event type and the event argument, and between the event argument and the event argument; ': ' is a separator between event argument and argument role directly, and ' | ' is a separator between event and event.

And repeating the steps 1-4, and extracting events of all the warning condition record data units in the warning condition record file, namely completing the extraction of the events of all the warning condition record files.

The following example illustrates the results of the above-described method of the present invention:

for example, the system inputs alert text data: in response, between 20 minutes in 5-month and 10-day in 2019 and 20 minutes in 10-day in 5-month and 22 minutes in 2019 and 44 minutes in 10-day in 5-month and 10-day, the inventor converts 25 total 32924.4 RMB into 25 two-dimensional codes provided by the information of the opposite branch X in turn through the own branch X (deducting money of an X merchant bank card). My location was at David Daoko X university, Cambodia, in the Tianhe area, Guangzhou, where I operated the money transfer. "

And (3) outputting: a money transfer event; 32924.4 Yuanren Ming Bin (money amount); site of Daoku Daokouxi university in Cambodia in the Tianhe area of Guangzhou; time is divided into 12 points from 20 time in 5/10/2019 to 44 points from 22 time in 5/10/2019.

The invention not only can solve the problems of error conduction and incapability of fully utilizing the interactive information among all subtasks in the extraction of the current warning condition record events, but also can solve the problem that no trigger words exist in the events in the warning condition record data. The invention realizes the high-efficiency extraction of the warning condition record data events, improves the accuracy and recall rate of the extraction of the warning condition record events, and brings data support and convenience for preventing crimes by public security departments.

Claims

1. A sequence-to-sequence-based extraction method for an alert condition record event is characterized by comprising the following steps: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector coding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.

2. The extraction method of sequence-to-sequence based alert event record according to claim 1, wherein the step 1 comprises: with X ═ X₁,x₂,...,x_nRepresents an alert writing data unit, x_nRepresenting a character in data, where n is the length of a data unit; firstly, dividing an alert condition record data unit into one character; and identifying words according to a preset word list and storing the results of the identified words until the cycle number is ended, thereby generating preprocessed data W ═ W { (W)₁,w₂,...,w_kIn which w_kRepresenting words in the preprocessed data, where k is the length of the preprocessed data, k<＝n。