CN111507107A - Sequence-to-sequence-based extraction method for alert condition record events - Google Patents
Sequence-to-sequence-based extraction method for alert condition record events Download PDFInfo
- Publication number
- CN111507107A CN111507107A CN202010292535.6A CN202010292535A CN111507107A CN 111507107 A CN111507107 A CN 111507107A CN 202010292535 A CN202010292535 A CN 202010292535A CN 111507107 A CN111507107 A CN 111507107A
- Authority
- CN
- China
- Prior art keywords
- sequence
- event
- condition record
- data
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services; Handling legal documents
Abstract
The invention discloses a sequence-to-sequence-based extraction method for an alarm condition record event, which comprises the following steps of: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector coding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.
Description
Technical Field
The invention relates to the field of natural language processing and deep learning, in particular to a sequence-to-sequence-based extraction method for an alert situation record event.
Background artthe current methods for extracting events from an alert statement are classified into two categories: one type is a series extraction method, namely the extraction method of the warning condition record event is divided into two subtasks: triggering word recognition classification and argument recognition classification, and constructing a complete alarm situation record event through post-processing. However, the series method is divided into two subtasks, so that the subtasks are independent from each other and cannot utilize mutual information and cascade errors between the subtasks, and error conduction is caused. The other method is a combined extraction method, namely, the trigger words, the types, the arguments and the argument roles are identified simultaneously to construct the alert condition record events. Although the joint extraction method makes the dependence between the subtasks more compact by sharing parameters, the problem of error transmission caused by cascading errors still exists. Meanwhile, through careful analysis of the data of the warning condition record, the event triggering conditions in the warning condition record are found to be divided into two types: firstly, triggering is carried out through a trigger word; the second is triggered by event statements. However, the current method for extracting events from the alert record is to recognize the trigger words first and then classify the events, and therefore, the problem of extracting all events from the alert record cannot be solved due to the limitation of the method.
The invention provides a sequence-to-sequence-based extraction method for an alarm condition record event, which aims to solve the problems that the existing extraction method for the alarm condition record has error conduction, cannot fully utilize interactive information among subtasks and does not have trigger words in the event of the alarm condition record data. The method not only can solve the problems that the error conduction in the extraction of the current warning condition record event cannot fully utilize the interactive information among all subtasks, but also can solve the problem that no trigger word exists in the event in the warning condition record data. The invention improves the accuracy and recall rate of extracting the events of the alert condition record.
Disclosure of Invention
In order to realize the purpose of the invention, the following technical scheme is adopted for realizing the purpose:
a sequence-to-sequence-based extraction method for an alert condition record event comprises the following steps: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector encoding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.
The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 1: with X ═ X1,x2,...,xnRepresents an alert writing data unit, xnRepresenting a character in data, where n is the length of a data unit; firstly, dividing an alert condition record data unit into one character; and identifying words according to a preset word list and storing the results of the identified words until the cycle number is ended, thereby generating preprocessed data W ═ W { (W)1,w2,...,wkIn which wkDenotes a word in the preprocessed data, where k is the length of the preprocessed data (k)<=n)。
The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 2:
2.1 vectorizing the preprocessed data W to generate a word vector: w and a position P ═ P for each word in the representation W1,p2,...,pkAre converted into vectors respectivelyAnd
2.2 pairs ofVector conversion is performed by BERT.The input BERT first generates three vectors of Q, K, V:
next, the weight between each input vector is calculated according to the Q, K vectors:
to stabilize the gradient, score was normalized by dividing byscore multiplied by V to obtain the score for each input vector is calculated from attention:
and then adding self _ attribute corresponding to each input vector to obtain an output Z:
Z=∑self_attention,
and finally, obtaining an output vector O of one layer by the output Z through a two-layer full-connection network:
O=max(0,ZW1+b1)W2+b2,
where max is a function of large value. To better capture the context information of a sentence, multiple self-Attention processes are performed on the input vector, namely, a Multi-Head self-Attention approach:
Zi=self_attentionii∈{1,....n},
where n represents the number of heads, then i ZiSplicing into a characteristic matrix, and obtaining output Z through a full-connection network:
Multi-Head=Concat(Z0,...,Zi),
Z=Multi-Head*W+b。
wherein WQ,WK,WV,W,W1,W2Represents a weight, bq,bk,bv,b,b1,b2Representing deviation, Q, K, V representing query vector, key vector and value vector, respectively, dkIs the square root of the K vector dimension and the Concat function is the splicing function. BERT uses 12 layers of the above network, each layer using 12 multi-head attentions, ultimately generating a word vector V of dimension 768o={v1,v2,...,vk}。
The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of 3: vector V of wordsOInputting a bidirectional L STM network, where the input is x at time tktForgetting door ftInformation to decide to discard or keep:
input door itFor updating neuron state:
current neuron state CtExpressed as:
Wherein Wf,Wi,WC,WoRepresents a weight, bf,bi,bC,boThe deviation is represented by the difference in the,output representing last moment, Ct-1Representing the state of the neuron at the last moment, tanh being an activation function, sigma representing a sigmod function, L the backward reasoning of STM being the same as the forward reasoning, and finally generating a fixed dimension vector
The extraction method of the alert condition record events based on sequence-to-sequence comprises the following steps of:
4.1 initialize L STM network at decoding end with S vector generated in step 3, i.e. set h0 DSetting the target warning condition record event vector sequence T as S and y as0,y1,y2,...,ymIn which y is0And ymIs an embedded vector of decoding start identifier SOS and end identifier EOS, ht DAnd ot D(t>1) is the hidden layer state and decoded output vector of the decoding module step t, ot(t>1) is the event element output of the decoding module at step t, etIs that the t step of the decoding module is based onAttention mechanism at the encoding and decoding end, yt∈T(0<t<m) is the sequence of event vectors at step t of the decoding module, [ e ]t:yt-1]The decoding input of the decoding module in the t step is as follows:
ht D=LSTM([et;yt-1],ht-1 D),
ot D=wvht D+bv,
ot=soft max(ot D),
ui t=wuhi t,qi t=wqht-1 D+bq,
at i=vatanh(qt i+ut i),at=soft max(at),
yt=embedding(ot)0<t<m,
wherein wv,wu,wq,vaRepresents a weight, bv,bqThe deviation is indicated.
4.2 when decoding, the input SOS starts decoding, firstly decoding to generate an event type, and then generating a special separator'; ', event argument, special separator': ' argument role, and analogize decoding output event until the end of input EOS decoding, specifically decoding as follows:
t is 2-step input [ e ═ b2;y1]Generating special characters'; ', output o2;
t is 3-step input [ e ═ c3;y2]Generating event argument and outputting o3;
t is 4-step input [ e ═ c4;y3]Generating a special character': ', output o4;
t is 5 step input [ e ═ b5;y4]Generating argument role, outputting o5;
And so on until the input [ e ]m+1:ym]Decoding is finished and output { o1,o2,...,om-1};
4.3, decoding and outputting to construct an alert condition record event: an event type; event argument: argument roles; event argument: argument role | event type; event argument: argument roles; event argument: recording events of the alert condition of the argument role, wherein; ' is a special separator between the event type and the event argument, and between the event argument and the event argument; ': ' is a separator between event argument and argument role directly, ' | ' is a separator between event and event;
and 5, repeatedly executing the steps 1-4, and extracting the events of all the warning condition record data units in the warning condition record file, namely finishing the extraction of the events of all the warning condition record files.
Drawings
FIG. 1 is a flow chart of a sequence-to-sequence based alert entry event extraction method;
FIG. 2 is a system block diagram of a sequence-to-sequence based alert transcript event extraction method.
Detailed Description
The following detailed description of the present invention will be made with reference to the accompanying drawings 1-2.
The invention provides a sequence-to-sequence-based extraction method for an alarm condition record event.
The method comprises the following concrete implementation steps:
with X ═ X1,x2,...,xnRepresents an alert writing data unit, xnRepresenting characters in data, where n is the length of a data unitInputting X into a word segmentation module, dividing an alarm condition stroke record data unit into one character by a BPE (double byte encoding) algorithm in the word segmentation module, counting the times of the occurrence of a single character or a combination of adjacent characters in the stroke record data unit in a preset word list, storing the single character or the adjacent character combination with the most times as a word each time until the cycle times are finished, and generating preprocessed data W ═ W-1,w2,...,wk},wkDenotes a word in the preprocessed data, where k is the length of the preprocessed data (k)<=n);
next, the weight between each input vector is calculated according to the Q, K vectors:
to stabilize the gradient, score was normalized by dividing byscore multiplied by V to obtain the score for each input vector is calculated from attention:
the resulting output Z of the self _ entries for each input vector is then added:
Z=∑self_attention,
and finally, obtaining a final output vector O by the output Z through a two-layer full-connection network:
O=max(0,ZW1+b1)W2+b2,
where max is a maximum function. To better capture the context information of a sentence, multiple self-Attention processes are performed on the input vector, namely, a Multi-Head self-Attention approach:
Zi=self_attentionii∈{1,....n},
where n represents the number of heads, then i ZiSplicing into a characteristic matrix, and obtaining output Z through a full-connection network:
Multi-Head=Concat(Z0,...,Zi),
Z=Multi-Head*W+b。
wherein WQ,WK,WV,W,W1,W2Represents a weight, bq,bk,bv,b,b1,b2Indicating deviationQ, K, V represent the query vector, the key vector and the value vector, respectively, dkIs the square root of the K vector dimension and the Concat function is the splicing function. BERT uses 12 layers of the above network, each layer using 12 multi-head attentions, ultimately generating a word vector V of dimension 768o={v1,v2,...,vk}。
Step 3, carrying out fixed dimension vector encoding on the word vector V generated in the step 2, namely, inputting the word vector V generated in the step 2 into a bidirectional L STM network for encoding, and generating a vector S with fixed dimension by encoding so as to capture more comprehensive context semantic features, wherein the method comprises the following steps:
inputting a word vector V into a bidirectional L STM network, and inputting x at the moment tkt(one-way L STM is used as an example for explanation), forgetting gate ftInformation to decide to discard or keep:
input door itFor updating neuron state:
current neuron state CtExpressed as:
Wherein Wf,Wi,WC,WoRepresents a weight, bf,bi,bC,boThe deviation is represented by the difference in the,output representing last moment, Ct-1Representing the state of the neuron at the last moment, tanh being an activation function, sigma representing a sigmod function, L the backward reasoning of STM being the same as the forward reasoning, and finally generating a fixed dimension vector
Step 4, decoding the fixed dimension vector S, generating an event type, an event argument and an argument role thereof and constructing a complete alarm record event, namely, initializing a unidirectional L STM network of a decoding end by using the vector S generated in the step 3, and decoding and generating the event type, the event argument and the argument role thereof to construct the complete alarm record event0 DS), set the target alert scenario record event vector sequence T ═ y0,y1,y2,...,ymIn which y is0And ymIs an embedded vector of decoding start identifier SOS and end identifier EOS, ht DAnd ot D(t>1) is the hidden layer state and decoded output vector of the decoding module step t, ot(t>1) is the event element output of the decoding module at step t, etThe t step of the decoding module is based on the attention mechanism of the encoding and decoding ends, yt∈T(0<t<m) is the sequence of event vectors at step t of the decoding module, [ e ]t:yt-1]The decoding input of the decoding module in the t step is as follows:
ht D=LSTM([et;yt-1],ht-1 D),
ot D=wvht D+bv,
ot=soft max(ot D),
ui t=wuhi t,qi t=wqht-1 D+bq,
at i=vatanh(qt i+ut i),at=soft max(at),
yt=embedding(ot)0<t<m,
wherein wv,wu,wq,vaRepresents a weight, bv,bqThe deviation is indicated.
When decoding, the input SOS starts decoding, firstly decoding to generate an event type, and then generating a special separator'; ', event argument, special separator': ' argument role, and analogize decoding output event until the end of input EOS decoding, specifically decoding as follows:
t is 2-step input [ e ═ b2;y1]Generating special characters'; ', output o2;
t is 3-step input [ e ═ c3;y2]Generating event argument and outputting o3;
t is 4-step input [ e ═ c4;y3]Generating a special character': ', output o4;
t is 5 step input [ e ═ b5;y4]Generating argument role, outputting o5;
And so on until the input [ e ]m+1:ym]Decoding is finished, and event { o } is output1,o2,...,om-1}; constructing an alert condition record event by decoding output as follows: an event type; event argument: argument roles; event argument: argument role | event type; event argument: argument roles; event argument: recording events of the alert condition of the argument role, wherein; ' is a special separator between the event type and the event argument, and between the event argument and the event argument; ': ' is a separator between event argument and argument role directly, and ' | ' is a separator between event and event.
And repeating the steps 1-4, and extracting events of all the warning condition record data units in the warning condition record file, namely completing the extraction of the events of all the warning condition record files.
The following example illustrates the results of the above-described method of the present invention:
for example, the system inputs alert text data: in response, between 20 minutes in 5-month and 10-day in 2019 and 20 minutes in 10-day in 5-month and 22 minutes in 2019 and 44 minutes in 10-day in 5-month and 10-day, the inventor converts 25 total 32924.4 RMB into 25 two-dimensional codes provided by the information of the opposite branch X in turn through the own branch X (deducting money of an X merchant bank card). My location was at David Daoko X university, Cambodia, in the Tianhe area, Guangzhou, where I operated the money transfer. "
And (3) outputting: a money transfer event; 32924.4 Yuanren Ming Bin (money amount); site of Daoku Daokouxi university in Cambodia in the Tianhe area of Guangzhou; time is divided into 12 points from 20 time in 5/10/2019 to 44 points from 22 time in 5/10/2019.
The invention not only can solve the problems of error conduction and incapability of fully utilizing the interactive information among all subtasks in the extraction of the current warning condition record events, but also can solve the problem that no trigger words exist in the events in the warning condition record data. The invention realizes the high-efficiency extraction of the warning condition record data events, improves the accuracy and recall rate of the extraction of the warning condition record events, and brings data support and convenience for preventing crimes by public security departments.
Claims (2)
1. A sequence-to-sequence-based extraction method for an alert condition record event is characterized by comprising the following steps: step 1, preprocessing the alert condition record data: segmenting the warning condition record data by taking a sentence as a unit to form a word set of each sentence; step 2, vectorizing and coding the preprocessed sentences to form word vectors; step 3, carrying out fixed dimension vector coding on the word vector generated in the step 2; and 4, decoding the fixed dimension vector to construct a complete alarm condition record event.
2. The extraction method of sequence-to-sequence based alert event record according to claim 1, wherein the step 1 comprises: with X ═ X1,x2,...,xnRepresents an alert writing data unit, xnRepresenting a character in data, where n is the length of a data unit; firstly, dividing an alert condition record data unit into one character; and identifying words according to a preset word list and storing the results of the identified words until the cycle number is ended, thereby generating preprocessed data W ═ W { (W)1,w2,...,wkIn which wkRepresenting words in the preprocessed data, where k is the length of the preprocessed data, k<=n。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292535.6A CN111507107A (en) | 2020-04-15 | 2020-04-15 | Sequence-to-sequence-based extraction method for alert condition record events |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010292535.6A CN111507107A (en) | 2020-04-15 | 2020-04-15 | Sequence-to-sequence-based extraction method for alert condition record events |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111507107A true CN111507107A (en) | 2020-08-07 |
Family
ID=71864682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010292535.6A Withdrawn CN111507107A (en) | 2020-04-15 | 2020-04-15 | Sequence-to-sequence-based extraction method for alert condition record events |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111507107A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347783A (en) * | 2020-11-11 | 2021-02-09 | 湖南数定智能科技有限公司 | Method for identifying types of alert condition record data events without trigger words |
CN112765980A (en) * | 2021-02-01 | 2021-05-07 | 广州市刑事科学技术研究所 | Event argument role extraction method and device for alert condition record |
CN114610866A (en) * | 2022-05-12 | 2022-06-10 | 湖南警察学院 | Sequence-to-sequence combined event extraction method and system based on global event type |
CN114936563A (en) * | 2022-04-27 | 2022-08-23 | 苏州大学 | Event extraction method and device and storage medium |
-
2020
- 2020-04-15 CN CN202010292535.6A patent/CN111507107A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112347783A (en) * | 2020-11-11 | 2021-02-09 | 湖南数定智能科技有限公司 | Method for identifying types of alert condition record data events without trigger words |
CN112347783B (en) * | 2020-11-11 | 2023-10-31 | 湖南数定智能科技有限公司 | Alarm condition and stroke data event type identification method without trigger words |
CN112765980A (en) * | 2021-02-01 | 2021-05-07 | 广州市刑事科学技术研究所 | Event argument role extraction method and device for alert condition record |
CN114936563A (en) * | 2022-04-27 | 2022-08-23 | 苏州大学 | Event extraction method and device and storage medium |
CN114610866A (en) * | 2022-05-12 | 2022-06-10 | 湖南警察学院 | Sequence-to-sequence combined event extraction method and system based on global event type |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111507107A (en) | Sequence-to-sequence-based extraction method for alert condition record events | |
CN110209836B (en) | Remote supervision relation extraction method and device | |
CN110119765B (en) | Keyword extraction method based on Seq2Seq framework | |
CN109471895B (en) | Electronic medical record phenotype extraction and phenotype name normalization method and system | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN110928997A (en) | Intention recognition method and device, electronic equipment and readable storage medium | |
CN113407660B (en) | Unstructured text event extraction method | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN112926303A (en) | Malicious URL detection method based on BERT-BiGRU | |
CN112507995B (en) | Cross-model face feature vector conversion system and method | |
CN112612871B (en) | Multi-event detection method based on sequence generation model | |
CN110276396B (en) | Image description generation method based on object saliency and cross-modal fusion features | |
CN114254655B (en) | Network security tracing semantic identification method based on prompt self-supervision learning | |
CN111814489A (en) | Spoken language semantic understanding method and system | |
CN114443827A (en) | Local information perception dialogue method and system based on pre-training language model | |
CN113806554A (en) | Knowledge graph construction method for massive conference texts | |
CN115438215A (en) | Image-text bidirectional search and matching model training method, device, equipment and medium | |
CN115687609A (en) | Zero sample relation extraction method based on Prompt multi-template fusion | |
CN115512195A (en) | Image description method based on multi-interaction information fusion | |
CN114610866A (en) | Sequence-to-sequence combined event extraction method and system based on global event type | |
CN112069825B (en) | Entity relation joint extraction method for alert condition record data | |
CN113704473A (en) | Media false news detection method and system based on long text feature extraction optimization | |
CN113051904A (en) | Link prediction method for small-scale knowledge graph | |
CN112347783A (en) | Method for identifying types of alert condition record data events without trigger words | |
CN113505937B (en) | Multi-view encoder-based legal decision prediction system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200807 |
|
WW01 | Invention patent application withdrawn after publication |