CN109635280A

CN109635280A - A kind of event extraction method based on mark

Info

Publication number: CN109635280A
Application number: CN201811400437.9A
Authority: CN
Inventors: 王家奎
Original assignee: Yuanbao Science And Technology (wuhan) Co Ltd
Current assignee: Yuanbao Science And Technology (wuhan) Co Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-04-16

Abstract

The invention belongs to information extraction technique fields, disclose a kind of event extraction method based on mark, event extraction is carried out with neural network in conjunction with mark, data processing is carried out first obtains the mark label of an event entity, then the data marked are trained with neural network, obtain the result of the event extraction of structuring；The entity tag of event is directly obtained by Entity recognition and event extraction model neural network based for the sentence of input；Entity tag of the method provided by the invention due to that can directly obtain event, not will cause error propagation, without redundancy, error rate is effectively reduced；On the other hand the event extraction method provided by the invention based on mark obtains loss function by way of improving the weight of classification, can be relieved the tendency of model caused by class imbalance and predicts more data category.

Description

A kind of event extraction method based on mark

Technical field

The invention belongs to information extraction technique fields, more particularly, to a kind of event extraction method based on mark.

Background technique

Information extraction technique is entity, the relationship, event that specified type is extracted from loose, structureless plain text Equal factural informations, and the information of export structure.It for example, can be according to the book of time of readers ' reading in wechat reading software Nationality obtains the interest preference of reader by extracting, and pushes books relevant to reader's interest.In News Field, to very long News allows reader to understand the content that news is expressed by event extraction, with the shortest time.Event extraction can not only be applied to mutually Networking arenas can be applied in other field, such as medical field, pass through diagnosis specification and the symptom table of sufferer State, the disease event of sufferer can be locked quickly, can allow patient to the understanding of symptom definitely.

In information extraction field, event extraction (Event Extraction) is most chosen in information extraction research One of the task of war property, what is mainly studied is how to go out the event information of structuring from non-structured Text Information Extraction. Such as in a media event, the text information that time, place, participant of event etc. form a structuring is extracted.

The method of event extraction, which has, at present much has based on traditional machine learning method, such as hidden Markov (Hidden Markov Model, HMM), condition random field (Conditional random field, CRF) etc., also there is base In the deep learning method of neural network, main representative has convolutional neural networks (Convolution neural Network, CNN), Recognition with Recurrent Neural Network (Recurrent neural network, RNN), long memory network (Long in short-term And short memory network, LSTM).In conventional machines learning method, by carrying out subordinate sentence to text information, dividing Word, Entity recognition, syntax and dependence, the meaning of a word for extracting the context of candidate word using the tool of natural language processing are special Semantic feature, and construction feature vector seek peace as the input of classifier and carrys out predicted events generation with a classifier Trigger word, and according to trigger word type, judge the affiliated type of the event.Deep learning method neural network based, to training Corpus urtext carries out data prediction；The event sentence sequence indicated with term vector is passed in neural network, mind is utilized It trains to obtain the semantic feature of each candidate trigger word through network；The event sentence sequence indicated with term vector is passed to volume In product neural network, candidate trigger word is obtained in the global characteristics of event sentence using neural metwork training；According to candidate trigger word Semantic feature and candidate trigger word sentence global characteristics, using Softmax come the prediction knot to each candidate trigger word Fruit carries out randomization, then obtains classification results, and according to trigger word type, judges the affiliated type of the event.

Existing event extraction method be primarily present of both problem: first is that by event extraction be divided into entity extraction and Event differentiates two stages, names the error of Entity recognition to will affect the differentiation of event, can bring cumulative errors；Second is that for Specific field, generally requires to construct a large amount of manual features, the process cost of feature selecting be it is very big, as model is multiple The raising of polygamy, maintainability can become worse and worse.

Summary of the invention

Aiming at the above defects or improvement requirements of the prior art, the present invention provides a kind of event extractions based on mark Method extracts event extraction and entity joint, its object is to improve event extraction method to reduce cumulative errors；And Artificial progress feature selecting is replaced using neural network, reduces the human cost of feature selection process.

To achieve the above object, according to one aspect of the present invention, a kind of event extraction side based on mark is provided Method includes the following steps:

(1) corpus is constructed: using content to be extracted as corpus, language of the set of all corpus as event extraction Expect library, segmentation subordinate sentence is carried out to the corpus of corpus；

Classify to the sentence of corpus, filters out the sentence comprising entity and event；

Wherein, the things that entity refers to objective reality and can be mutually distinguishable, event refer to generation in some specific time The change of thing or state that section is made of one or more movements that one or more roles participate in；

(2) training set and test set are constructed: corpus text is randomly assigned to be formed two datasets, respectively training set, Test set；In a preferred embodiment, the amount of text ratio of training set and test set is 4:1；

(3) corpus is labeled: the entity in sentence is marked in the form of { boundary position-event-entity } Note: entity location information { B (entity starts), I (inside entity), E (entity ending), S (single entity) }, event type { root Encoded according to the relationship type that corpus pre-defines }, entity type information, other parts in sentence all label for “O”。

(4) neural network model is constructed: with Bi-LSTM (Bi-directional Long Short Term Memory) As neural network model, the LSTM network that Bi-LSTM is identical by two structures but weight is not shared is constituted；

By the corpus forward direction that will have been marked and be inversely separately input in the two LSTM networks, respectively obtain it is positive and The two feature vectors are stitched together to obtain contextual feature vector by reverse feature vector；Based on context feature to Amount calculates the mark label probability of prediction, establishes loss function according to mark label probability；

(5) training neural network model: initializing network parameter, inputs training data to neural network model Parameter optimize；Neural network model is assessed on test set, when the extraction accuracy rate closed in test set exists In preset threshold section, determine that neural network model reaches convergence；

(6) event prediction is carried out to text to be extracted: by the trained neural network model of text input to be extracted, The Tag Estimation result of each character of forecasting sequence；Text is spliced according to Tag Estimation result, obtains structuring Event extraction result.

Preferably, the above-mentioned decimation in time method based on mark, in step (4), obtain the contextual feature of sequence S to The method of amount includes:

(4.1) the sequence S=(x for being n by the length of input₁,x₂,…,x_n), word vector be input to positive LSTM one by one In network, positive feature vector is obtained

(4.2) the sequence S=(x for being n by the length of input₁,x₂,…,x_n), word vector be input to reverse LSTM one by one In network, reversed feature vector is obtained

(4.3) positive, reversed two feature vectors are stitched together, obtain the contextual feature vector of sequence S

(4.4) based on above-mentioned contextual feature vector, normalization layer (Softmax) by neural network is calculated The mark label probability of prediction

Wherein, y_t=W_yh_t+b_y；W_yIt is softmax layers of parameter matrix, N_tIt is the number of all labels, b_yRefer to linear The biasing of layer；I refers to the index of label；T refers to the index of word in sentence；

(4.5) loss function is defined are as follows:

Wherein, | D | refer to the size of training set, L_jRefer to the length of sentence, L_jtRefer in jth word t-th of word Cross entropy；T refers to the index of word in sentence；J refers to the index of sentence in training set；Refer to j-th of word in i-th of sentence Label；Refer to the probability distribution of j-th of Word prediction label in i-th of sentence；α refers to the mark other weight of tag class, Influence of the bigger expression of weight to classification is bigger, and the influence to model is bigger；Label is in ' O ' step (3) for in sentence Specified label is marked in other parts.

Use cross entropy different as loss function from existing classification problem and sequence labelling problem, it is right in this step Different event categories is assigned with different weights, by regulating and controlling the training weight validity event classification of each event category not Balance the influence to prediction result.

Preferably, the above-mentioned decimation in time method based on mark, in step (1), in order to guarantee the generalization of extraction model Can, some sentences not comprising entity or event are also added in corpus text, the corpus text an of Sentence-level is obtained, To increase the noise of neural network model, the generalization ability of neural network is improved.

Preferably, the above-mentioned decimation in time method based on mark, in step (1), the sentence not comprising event is in corpus text Accounting in this is up to 10%.

Preferably, the above-mentioned decimation in time method based on mark, with tables of data Storage Estimation as a result, obtaining structuring Text information.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, can obtain down and show Beneficial effect:

(1) the event extraction method provided by the invention based on mark carries out event pumping with neural network in conjunction with mark It takes, progress data processing first obtains the mark label of an event entity, and the number marked is then trained with neural network According to obtaining the result of the event extraction of structuring；For the sentence of input, pass through Entity recognition and thing neural network based Part extraction model directly obtains the entity tag of event；

The prior art is extracted using the method for assembly line, Entity recognition is first named to the sentence of input, to knowledge Not Chu Lai entity and event carry out combination of two, then relationship classification is carried out, finally there are the sentence of event entity relationship works For input；The mode of this assembly line can make the erroneous effects of Entity recognition module to relationship classification performance；And ignore two Existing relationship between a subtask, and classify again due to being matched two-by-two to the event and entity that identify, that A little not related entity occurrences promote error rate to that can bring redundant information；

In comparison, entity tag of the method provided by the invention due to that can directly obtain event, will not make Error rate is effectively reduced without redundancy at error propagation；

(2) the event extraction method provided by the invention based on mark is obtained by way of improving the weight of classification Loss function can be relieved the tendency of model caused by class imbalance and predict more data category；Class imbalance problem is certainly The common problem of right Language Processing, the prior art are the modes for increasing data set or reducing data set, are lost there are information Problem；In comparison, this method of the invention can preferably alleviate the problem of class imbalance；

(3) the event extraction method provided by the invention based on mark, neural network structure are easy to use.

Detailed description of the invention

Fig. 1 is the building of the corpus in one embodiment of the event extraction method provided by the invention based on mark Process schematic；

Fig. 2 is showing for the mark corpus in one embodiment of the event extraction method provided by the invention based on mark It is intended to；

Fig. 3 is the structure of the Bi-LSTM in one embodiment of the event extraction method provided by the invention based on mark Schematic diagram.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used to explain this hair It is bright, it is not intended to limit the present invention.In addition, technology involved in the various embodiments of the present invention described below is special Sign can be combined with each other as long as they do not conflict with each other.

Event extraction method provided by the invention based on mark, comprising: obtain corpus, the corpus of acquisition is marked Infuse and be input to the pretreatment before neural network model；Neural network parameter is set and is obtained with neural network effective Feature, by constantly Optimal Parameters being trained to make neural network model generalization ability constantly and enhance；Use normalized function Softmax obtains prediction probability, and the result of prediction is stored into the tables of data of structuring.Below in conjunction with Fig. 1~3 and Embodiment is further described.

The event extraction method based on mark provided using embodiment is to the method for the nearly event extraction of news, including such as Lower step:

(1) corpus constructs: obtain corpus on news website, and the news corpus of acquisition with " n " symbol and "." point Every to carry out segmentation subordinate sentence, and remove the information useless to event extraction, such as the author reported, source of news；To news language Sentence in material is classified to filter out the sentence comprising entity and event；In a preferred embodiment, it is added to 10% Sentence not comprising event improves the generalization ability of neural network to increase the noise of neural network model.

News corpus is commonly present the indefinite information of reference relationship since contextual relation is closer, such as " public Department ", " the said firm ", " its " etc. refer to noun, will refer to noun in a preferred embodiment and are converted to actual entity.Such as " on December 20th, 2017, * * science and technology complete A wheel financing, and 100,000,000 yuans of the amount of money, this is that the said firm finances for the first time ", will " the said firm " therein is converted to " * * science and technology "；In embodiment, actual Business Name is replaced with one or more " * ".

(2) training set and test set are constructed: the sentence screened is randomly divided into two data according to the ratio of 4:1 Collection, respectively as training set and test set.

(3) corpus labeling:

(3.1) event type is determined, such as " financing ", " investment ", " marriage "；

(3.2) entity class, such as " time ", " name ", " company name ", " mechanism name ", " amount of money ", " round " are determined；

(3.3) according to the corresponding stamp methods of each word, sentence is labeled；It must include event in sentence, At least one entity, except noise text information.

Referring to Fig. 2: " July 23 message, * * * * education are announced to obtain 1.3 hundred million yuan of A wheel financing.", the words includes four kinds Entity, time are " July 23 ", and company is entitled " * * * * education ", and the amount of money is " 1.3 hundred million yuan ", and round is " A wheel "；The thing for including Part is " financing "；

Provider location, event title, entity type are marked out when marking entity；In embodiment, by " * * * * education " It is labeled as " B-RZ-GS I-RZ-GS I-RZ-GS E-RZ-GS ", " July 23 " is labeled as " B-RZ-SJ I-RZ-SJ I- RZ-SJ E-RZ-SJ ", and " 1.3 hundred million yuan " are labeled as " B-RZ-JE I-RZ-JE I-RZ-JE I-RZ-JE E-RZ-JE ", most " A wheel " is labeled as " B-RZ-LC E-RZ-LC " afterwards, and other parts content is all labeled as " O "；The event class for including in example Type is financing, is indicated when mark with " RZ ", event type having time, company name, the amount of money, the round for including, mark when Standby " SJ ", " GS ", " JE ", " LC " indicate that provider location is indicated with { B, I, E, S }.

(4) neural network structure model is constructed；

Embodiment uses Bi-LSTM network；It is the sentence X=(w of n with a length₁,w₂,…,w_t,…w_n) as mind Input through network, wherein w_tIt is t-th of word of sentence, the corresponding word vector of sentence is X=(x₁,x₂,…,x_n).By sentence In word vector be input to the length of single layer one by one in short-term in memory network, obtain each word w_tFeature vector aboveThis reality Apply be input in example the word vector of LSTM network calculating it is as follows:

i_t=δ (W_xix_t+W_hih_t-1+b_i)

f_t=δ (W_xfx_t+W_hfh_t-1+b_f)

o_t=δ (W_xox_t+W_hoh_t-1+b_o)

h_t=o_ttanh(c_t)

Wherein, δ is sigmoid function, W₍₎And b₍₎For the parameter of neural network；It is in the same way that sentence is reverse It is input to another to grow in memory network in short-term, calculates word w_tFollowing feature vector

Wherein, first LSTM network is to LSTM (forward LSTM) before, and second LSTM network is after to LSTM (backward LSTM).The training method of the two LSTM, parameter W₍₎、 b₍₎Meaning is the same, but each nerve The parameter of member is not shared, therefore the parameter W of each neuron₍₎And b₍₎Value be different.To the word vector w of input_t, Available feature vector is distinguished to LSTM and backward LSTM by precedingWithBy the output of the two LSTM be it is positive and Reverse features vector is stitched together, and obtains word w_tContextual feature vector

Feature vector h based on context_t, the mark label probability of prediction is calculated by Softmax layers:

y_t=W_yh_t+b_y

Wherein w_tIt is Softmax layers of parameter matrix, N_tIt is the number of all labels；b_yRefer to the biasing of linear layer；I is The index of index label；T refers to the index of word in sentence；

Objective function, that is, loss function is defined as:

Wherein, | D | refer to the size of training set, L_jRefer to the length of sentence, L_jtRefer in jth word t-th of word Cross entropy；T refers to the index of word in sentence；J refers to the index of sentence in training set；Refer in i-th of sentence j-th The label of word；Refer to the probability distribution of j-th of Word prediction label in j-th of sentence；α refers to the mark other power of tag class Weight, influence of the bigger expression of weight to classification is bigger, and the influence to model is bigger；O refers to the label of " other " classification.At this In step, alleviate the class imbalance problem of training data by distributing to different weights.

(5) training neural network model:

To the parameter of neural network, the weight random initializtion of Embedding, LSTM and linear layer parameter are also random Initialization；It attempts the parameter groups such as different dropout, hidden layers, learning rate and merges observation training result to obtain The parameter combination of optimization.

(6) event prediction is carried out to text to be extracted: by in the trained neural network of text input extracted, obtained To the annotation results of each character of text.Then the corresponding meaning represented of label is read out, forms text information, These text informations are spliced to form a text sentence, obtain the text information of structuring；Or it is stored separately on a data In structure.

In a preferred embodiment, the result extracted using tables of data storage text；Corresponding entity and event There are also other informations to be stored in tables of data；By tables of data, the final result that can be extracted with open-and-shut identification events. Referring to following table 1, the result that media event extracts is shown with tables of data, can be perfectly clear by tables of data Solve the basic content of media event.

1 media event of table extracts result data table

Company name	Time	Event	Round	The amount of money
					* science and technology	2017-12-20	Financing	A wheel	100000000 people member
* * * education	July 23	Financing	A wheel	1.3 hundred million yuan
					* small beauty	2017-08-13	Financing	…	30000000 yuan

Data processing, is obtained an event entity first by the event extraction method provided by the invention based on mark Label is marked, the data marked are then trained with neural network, finally obtain the result of the event extraction of structuring；It mentions Go out a kind of new notation methods and carries out event extraction in conjunction with the method for neural network；Construct the language of News Field Expect library, the task of event extraction is converted into the task of classification and sequence labelling, is eventually converted into a similar name entity Identify the task of (NER)；On the other hand, by doing step processing to loss function in this method, by regulating and controlling each event class Influence of other trained weight validity event class imbalance to prediction result, can effectively reduce data category imbalance to pre- Survey the influence of result.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all wrap Containing within protection scope of the present invention.

Claims

1. a kind of event extraction method based on mark, which comprises the steps of:

(1) corpus is constructed: using content to be extracted as corpus, corpus of the set of all corpus as event extraction, Segmentation subordinate sentence is carried out to the corpus of corpus；Classify to the sentence of corpus, filters out the sentence comprising entity and event；

(2) corpus text is randomly assigned to form two datasets, respectively training set, test set；

(3) corpus is labeled: the entity in sentence is labeled in the form of { boundary position-event-entity }: real Body position information { B (entity starts), I (inside entity), E (entity ending), S (single entity) }, event type are { according to corpus The relationship type pre-defined is encoded }, entity type information, the other parts in sentence all mark the mark for being Label.

(4) using Bi-LSTM as neural network model, by the corpus forward direction marked and inversely it is separately input to Bi-LSTM's In two LSTM networks, positive and reverse feature vector is respectively obtained, the two feature vectors are stitched together to obtain Following traits vector；Based on context feature vector calculates the mark label probability of prediction, establishes damage according to mark label probability Lose function；

(5) network parameter is initialized, input training data optimizes the parameter of neural network model；In test set On neural network model is assessed, when the extraction accuracy rate closed in test set is in preset threshold section, determine nerve Network model reaches convergence；

(6) by the trained neural network model of text input to be extracted, the Tag Estimation knot of each character of forecasting sequence Fruit；Text is spliced according to Tag Estimation result, obtains the event extraction result of structuring.

2. the decimation in time method based on mark as described in claim 1, which is characterized in that the entity refers to objective reality And the things that can be mutually distinguishable, event refer to one for occurring to be participated in by one or more roles in some special time period or The change of the thing or state of multiple movement compositions.

3. the decimation in time method based on mark as claimed in claim 1 or 2, which is characterized in that in step (4), in acquisition The method of following traits vector includes:

(4.1) the sequence S=(x for being n by the length of input₁,x₂,…x_j…,x_n), word vector be input to positive LSTM net one by one In network, positive feature vector is obtained

(4.2) the sequence S=(x for being n by the length of input₁,x₂,…x_j…,x_n), word vector be input to reverse LSTM net one by one In network, reversed feature vector h ' is obtained_t；

(4.4) it is based on contextual feature vector, normalization layer (Softmax) by neural network calculates the mark mark of prediction Sign probability

Wherein, y_t=W_yh_t+b_y；W_yIt is softmax layers of parameter matrix, N_tIt is the number of all labels, b_yRefer to the inclined of linear layer It sets；I refers to that sentence is numbered；

(4.5) loss function is defined are as follows:

Wherein, | D | it is the size of training set, L_jIt is the length of sentence,Indicate the label of j-th of word in sentence,It indicates The probability that Tag Estimation comes out, L_jtIt is a switching performance function, for indicating objective function and label is the relationship of ' O ', mark Label are in ' O ' step (3) for other parts in sentence to be marked with specified label.

4. the decimation in time method based on mark as claimed in claim 1 or 2, which is characterized in that, will be some in step (1) Sentence not comprising entity or event is also added in corpus text, the corpus text an of Sentence-level is obtained, to increase nerve The noise of network model improves the generalization ability of neural network.

5. the decimation in time method based on mark as claimed in claim 4, which is characterized in that in step (1), do not include event Accounting of the sentence in corpus text up to 10%.

6. the decimation in time method based on mark as claimed in claim 1 or 2, which is characterized in that use tables of data Storage Estimation As a result, obtaining the text information of structuring.