CN110348018A

CN110348018A - The method for completing simple event extraction using part study

Info

Publication number: CN110348018A
Application number: CN201910642480.4A
Authority: CN
Inventors: 陈文亮; 王铭涛; 杨耀晟; 张民
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-10-18

Abstract

The invention discloses a kind of methods for completing simple event extraction using part study.A kind of method for completing simple event extraction using part study of the present invention, comprising: Marking Guidelines building process: according to the three classes under frame: dynamic guest, double acting word, other, provide specific event definition.Beneficial effects of the present invention: trial solves the problems, such as spill tag present in the data of remote supervisory acquisition and wrong mark, improves model for the recognition performance of name entity.

Description

The method for completing simple event extraction using part study

Technical field

The present invention relates to simple event extraction fields, and in particular to a kind of to complete simple event extraction using part study Method.

Background technique

Simple event is defined as the event that verb and its object are directly connected to, for describing scene.Such as: play basketball, It plays soccer, have breakfast, make a phone call.Simple event extraction problem is converted name Entity recognition problem by we, knows from sentence Not Chu predefined event argument class instance.

Entity recognition task has been achieved with good progress by studying for many years.Main Research Challenges are at present: not In same domain and different application, novel entities classification is usually identified, be difficult corresponding rapid build high performance system.It is new real in building When body classification recognition system, it usually needs there is mark corpus to carry out training pattern, and be at this moment difficult to make full and accurate accurate entity Marking Guidelines, and labeled data is time-consuming and laborious.In addition, domain-adaptive problem is also a very distinct issues, i.e. entity Identifying system marks performance on frontier text and declines by a big margin.

Currently, common entity recognition method can substantially be divided into: 1) rule-based and dictionary method；2) based on tradition The method of machine learning model；3) based on the method for deep learning.On the basis of three kinds of methods, there are also the buildings of some systems to exist On mixing between them.

Existing the relevant technologies:

1, data construct:

Expert's mark, i.e. data mark personnel are that the expert in place field or Marking Guidelines formulate personnel, are obtained with this Take the labeled data of high quality.

Crowdsourcing mark.Crowdsourcing is a kind of distributed Resolving probiems and dimension model, by proposing data and Marking Guidelines Supply layman.It is labeled after simple training, the data for having mark is finally supplied to crowdsourcing data publisher. Often " trap " of setting unit in the process shows according to the mark of layman later, provides certain reward.

Remote supervisory.Assuming that in the case of there is a small amount of artificial labeled data and entity word table at the beginning, remote supervisory method With the vocabulary on a large scale without mark corpus in matched, the character string matched is taken as correctly marking.

2, based on the entity recognition method of deep learning:

Most common model is BiLSTM-CRF model, model be chain structure be divided into Embedding layers (with to Amount indicates the word or word of input), LSTM layers two-way (implicit indicate is extracted to the modeling of whole word on the basis of vector indicates), line Property layer the mapping relations of character and label (series connection) and last CRF layer (mapping relations of connect label and label) composition. The experimental results showed that BiLSTM-CRF obtains better effect, has reached or be more than the CRF mould based on feature-rich Type.In characteristic aspect, which does not need particularly preferred Feature Engineering, can be reached very using term vector and character vector Good effect.

There are following technical problems for traditional technology:

1, data construct:

1) expert's mark number is generally less, and mark speed is slow, can not obtain mark corpus on a large scale, be unable to satisfy reality The application demand on border.

2) personnel of crowdsourcing mark do not have too many experience to data fields, need to formulate detailed mark rule before mark Model, and need through training after a period of time.Different mark persons has different understanding and mark to practise for standardizing with corpus It is used, cause to lead to labeled data poor quality in the presence of inconsistent or error label is largely marked in annotation results.

Example:

Mark person 1: packaging is tightly sent to and does not collide with.

Mark person 2:{ packs@EVENT } it is tightly sent to and does not collide with.

" packaging " is not expressed as simple event in the context of the words, belongs to and marks inconsistent example.

3) remote supervisory is limited to the scale and quality for the seed resource having been built up, and is much not logged in resource and is easy to be lost Leakage.Data configuration depends on matching criterior and algorithm unduly, so the data that remote supervisory obtains have two --- spill tag It is marked with mistake.

Example 1: I likes { the no longer hesitation@SONG } and goodbye ideal of Beyond.[spill tag]

Example 2: my { no longer hesitation@SONG } has directly gone to station.[mistake mark]

In example 1, " goodbye is ideal " is also a first song, due to not having in vocabulary, leads to spill tag.In example 2, " no longer still Henan " is not title of the song, belongs to wrong mark.

4) for marking used Marking Guidelines, need to combine closely actual task and data, by constantly improve It can finally decide.At present the event Marking Guidelines towards electric business field almost without.

2, Named Entity Extraction Model neural network based:

Neural network model has been widely used in multiple natural language processing tasks at present, achieves compared with conventional model No small progress.But it also exposes many disadvantages:

1) data problem: neural network model can obtain good effect and be built upon on the basis of big data, with biography The machine learning algorithm of system is compared, and neural network needs more data.Final modelling effect largely with offer Data are related, and the quality of data is particularly important.

2) it is weaker that ability can be explained, do not have available feature to explain it result predicted.

3) it calculates in cost often costly than traditional algorithm, due to the increase of training data and the increasing of network depth Add, needs more computing resources.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of methods for completing simple event extraction using part study, lead to It crosses and event extraction problem is converted into name Entity recognition problem.Then event-resources are enriched according to electric business field and provides simple thing The definition of part goes out detailed entity Marking Guidelines according to the practical mark continuous iteration of situation.Using small-scale expert mark and greatly Scale crowdsourcing mark, therefrom extracts event-resources list.Recycle remote supervisory method, on a large scale without labeled data into Rower note.It attempts to solve the problems, such as spill tag present in the data of remote supervisory acquisition and wrong mark using local learning method, from And improve entity recognition model neural network based.

In order to solve the above-mentioned technical problems, the present invention provides a kind of sides that simple event extraction is completed using part study Method, comprising:

Marking Guidelines building process:

According to the three classes under frame: dynamic guest, double acting word, other, provide specific event definition.

The example for meeting definition is provided according to practical corpus on the basis of this, the place there are ambiguity is provided and pays attention to thing ?.

The building of specification needs continuous iteration, constantly improve according to the actual situation, it is clear and intuitive to ultimately form an orderliness Clear document.

Remote supervisory corpus building process:

Simple event definition and Marking Guidelines are obtained first.

Recruitment mark personnel give training according to specification, the artificial labeled data of certain scale are then obtained, by this part Entity in data extracts, and constructs entity vocabulary.

It is matched with the entity vocabulary not marking in text on a large scale, obtains remote supervisory data set.This part It include a certain number of noises in data.

Target is exactly rationally to train the preferably simple thing of a performance using two parts data above as training data Part identification model.

Identification model based on BiLSTM-CRF:

BiLSTM-CRF model is handled identification mission as sequence labelling task, chinese character sequence when mode input, defeated It is sequence label out.In name Entity recognition task, BiLSTM-CRF has been achieved with good result, and element mark is converted into sequence BIEO label is used when column mark, wherein B-XX indicates that first Chinese character of element XX, E-XX indicate the last one Chinese of element Other Chinese character markings of word, element are I-XX, rather than element Chinese character is all labeled as O.

In BiLSTM-CRF model, for the chinese character sequence of input, neuron spy is constructed by two-way LSTM first Sign, then combines these features and is input to CRF layers of progress Tag Estimation.Entire model is divided into three major parts: 1) word vector It indicates: input word string is expressed as word vector, i.e., discrete type input is converted into the input of low-dimensional neuron；2) feature extraction: logical It crosses two-way LSTM and linear transformation and word vector is converted into Neuron characteristics；3) entity marks: feature being input to CRF layers, is made Entity tag is obtained with labeling module；

Word vector indicates: discrete type being inputted Chinese character by a neural expression layer and is converted into the input of low-dimensional neuron；Make With a Looking-up table, the vector that each Chinese character is store in table is indicated.The initial value of vector can by random number into Row initialization or in advance on a large scale without being trained on mark corpus with tool.During model training, vector owns Parameter of the numerical value as model optimizes in company with other parameters together in an iterative process；In the word sequence of given Chinese sentence, It crosses to table look-up and obtains corresponding word vector expression.

Feature extraction: based on input word sequence vector, we extract feature by two-way LSTM and linear layerThese features will be used for CRF entity labeling module.LSTM is shot and long term memory network, is a kind of circulation mind Through network, natural language sentences can be modeled well.We extract two-way LSTM to sentence forward and reverse Feature carries out the hidden layer expression that splicing obtains character

It is calculated by following equation

Wherein W and b is model parameter.Above formula is exactly that character is mapped on label, and final sequence is exactly It is made of the label in tally set.

Entity mark: it is final to be decoded using CRF layers, enable model to learn to the dependence between label and label to close System.

It is as follows to solve calculation formula:

In parameter training, penalty values are calculated using Log-likelihood.The probability of artificial annotated sequence are as follows:

Penalty values are as follows:

Trained optimization aim is to minimize this penalty values.

Based on the annotator locally learnt:

The basic ideas locally learnt are that the imperfect mark sentence in the labeled data of part is converted to multipath mark Sentence is improved to above-mentioned CRF layers of optimization aim；Basic model is used as using based on BiLSTM-CRF model.

It is as follows for completely marking the definition of probability of sentence in one of the embodiments:

Wherein, X is the sentence word vector of input, Y_xIndicate the corresponding all legal set of paths of x.

It is as follows for the definition of probability of imperfect mark sentence in one of the embodiments:

Wherein, D indicates multipath annotated sequence.The conditional probability of namely one trained example is that multipath includes to own The sum of the probability in path.Then parameter Estimation can be completed with solution mode identical with baseline system.

Wherein loss function is defined as follows in one of the embodiments:

Loss (θ, X, D)=- logp (D | X).

It in one of the embodiments, specifically, should if a word no specified label in the mark of part The label of word is labeled as UKN, indicates that all labels are likely to.

In one of the embodiments, on this basis, the optimization object function based on multipath mark sentence is designed, from And it efficiently uses part labeled data and carries out simulated training.

Model mainly has two states of training and prediction in one of the embodiments,.It needs to input in the training process Data with mark, model need to constantly update parameter；Under optimization aim, so that the annotation results of output and true value are most It may be consistent；This needs updates these parameters by ceaselessly loop iteration, so that the loss value in above-mentioned formula constantly subtracts It is small, allow model that can acquire better parameter；Another state is prediction, and the input during predicting is the number without mark According to what is at this moment used is exactly trained model.Undated parameter is not needed in this process, using the output of model as last Prediction result.

In one of the embodiments,

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage The step of computer program, the processor realizes any one the method when executing described program.

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor The step of any one the method.

A kind of processor, the processor is for running program, wherein described program executes described in any item when running Method.

Beneficial effects of the present invention:

It attempts to solve the problems, such as spill tag present in the data of remote supervisory acquisition and wrong mark, improves model for naming entity Recognition performance.

Detailed description of the invention

Fig. 1 is that the present invention completes to be widely used in Entity recognition in the method for simple event extraction using part study The BiLSTM-CRF model schematic of task.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention.

The purpose of this patent is summarized, is exactly reduced present in the data that remote supervisory is got using local learning method Spill tag problem.In order to accomplish this point, we define the simple event Marking Guidelines towards electric business field.It is marked by crowdsourcing Data, obtain remote supervisory data, be added on the name entity recognition method based on deep learning local learning method come Promote event recognition performance.

Marking Guidelines building process:

1, according to the three classes under frame: dynamic guest, double acting word, other, provide specific event definition.

2, the example for meeting definition is provided according to practical corpus on the basis of this, provides attention for the place there are ambiguity Item.

3, the building standardized needs continuous iteration, constantly improve according to the actual situation, it is clearly straight to ultimately form an orderliness See clear document.

Remote supervisory corpus building process:

1, simple event definition and Marking Guidelines are obtained first.

2, recruitment mark personnel give training according to specification, the artificial labeled data of certain scale are then obtained, by this portion Entity in divided data extracts, and constructs entity vocabulary.

3, it is matched with the entity vocabulary in 2 not marking in text on a large scale, obtains remote supervisory data set.This portion It include a certain number of noises in divided data.

Our target is exactly that it is preferable rationally to train a performance using two parts data above as training data Simple event recognition model.

Identification model based on BiLSTM-CRF:

BiLSTM-CRF model is handled identification mission as sequence labelling task, chinese character sequence when mode input, defeated It is sequence label out.In name Entity recognition task, BiLSTM-CRF has been achieved with good result, therefore we intend choosing BiLSTM-CRF is benchmark model.Element mark is converted into using BIEO label when sequence labelling, and wherein B-XX indicates element XX First Chinese character, E-XX indicates the last one Chinese character of element, and other Chinese character markings of element are I-XX, rather than element Chinese character All it is labeled as O.

In BiLSTM-CRF model, for the chinese character sequence of input, we construct nerve by two-way LSTM first Then first feature combines these features and is input to CRF layers of progress Tag Estimation.Entire model is divided into three major parts: 1) word Vector indicates: input word string being expressed as word vector, i.e., discrete type input is converted into the input of low-dimensional neuron；2) feature is taken out It takes: word vector is converted by two-way LSTM and linear transformation by Neuron characteristics；3) entity marks: feature is input to CRF Layer obtains entity tag using labeling module.

Word vector indicates: discrete type being inputted Chinese character by a neural expression layer and is converted into the input of low-dimensional neuron.I Use a Looking-up table, the vector that each Chinese character is store in table indicates.The initial value of vector can be by random Number carries out initialization or in advance on a large scale without being trained on mark corpus with tool.During model training, vector Parameter of all numerical value as model optimizes in company with other parameters together in an iterative process.In the word sequence of given Chinese sentence When column, we obtain corresponding word vector expression by tabling look-up.

It is calculated by following equation

It is as follows to solve calculation formula:

In parameter training, we calculate penalty values using Log-likelihood.The probability of artificial annotated sequence are as follows:

Penalty values are as follows:

Trained optimization aim is to minimize this penalty values.

Based on the annotator locally learnt:

The basic ideas locally learnt are that the imperfect mark sentence in the labeled data of part is converted to multipath mark Sentence is improved to above-mentioned CRF layers of optimization aim.Specifically, if a word is no specified in the mark of part Label indicates that all labels are likely to then the label of the word is labeled as UKN.On this basis, we, which design, is based on multichannel Diameter marks the optimization object function of sentence, so that efficiently using part labeled data carries out simulated training.We are retouched using upper section That states is used as basic model based on BiLSTM-CRF model.It is as follows for the definition of probability for completely marking sentence:

Wherein, X is the sentence word vector of input, Y_xIndicate the corresponding all legal set of paths of x.And for imperfect mark The definition of probability for infusing sentence is as follows:

Wherein, D indicates multipath annotated sequence.The conditional probability of namely one trained example is that multipath includes to own The sum of the probability in path.Then parameter Estimation can be completed with solution mode identical with baseline system, wherein loss function It is defined as follows:

Loss (θ, X, D)=- logp (D | X)

To some supplements of above scheme:

Model mainly has two states of training and prediction.It needs to input the data with mark, model in the training process It needs to constantly update parameter.Under our optimization aim, so that the annotation results and true value of output are as consistent as possible.This is needed These parameters are updated by ceaselessly loop iteration, so that the loss value in above-mentioned formula constantly reduces, allow model can Acquire better parameter.Another state is prediction, and the input during predicting is the data without mark, is at this moment used just It is trained model.Undated parameter is not needed in this process, using the output of model as last prediction result.

Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims

1. a kind of method for completing simple event extraction using part study characterized by comprising

Marking Guidelines building process:

The example for meeting definition is provided according to practical corpus on the basis of this, provides points for attention for the place there are ambiguity.

The building of specification needs continuous iteration, constantly improve according to the actual situation, and it is clear and intuitive clear to ultimately form an orderliness Document.

Remote supervisory corpus building process:

Simple event definition and Marking Guidelines are obtained first.

Recruitment mark personnel give training according to specification, the artificial labeled data of certain scale are then obtained, by this partial data In entity extract, construct entity vocabulary.

It is matched with the entity vocabulary not marking in text on a large scale, obtains remote supervisory data set.This partial data In include a certain number of noises.

Target is exactly rationally to train the preferably simple event of a performance using two parts data above as training data and know Other model.

Identification model based on BiLSTM-CRF:

BiLSTM-CRF model is handled identification mission as sequence labelling task, chinese character sequence when mode input, and output is Sequence label.In name Entity recognition task, BiLSTM-CRF has been achieved with good result, and element mark is converted into sequence mark BIEO label is used when note, wherein B-XX indicates that first Chinese character of element XX, E-XX indicate the last one Chinese character of element, Other Chinese character markings of element are I-XX, rather than element Chinese character is all labeled as O.

In BiLSTM-CRF model, for the chinese character sequence of input, Neuron characteristics are constructed by two-way LSTM first, so After combine these features and be input to CRF layers of progress Tag Estimation.Entire model is divided into three major parts: 1) word vector indicates: Input word string is expressed as word vector, i.e., discrete type input is converted into the input of low-dimensional neuron；2) feature extraction: by two-way Word vector is converted into Neuron characteristics by LSTM and linear transformation；3) entity marks: feature being input to CRF layers, uses mark Module obtains entity tag；

Word vector indicates: discrete type being inputted Chinese character by a neural expression layer and is converted into the input of low-dimensional neuron；Use one A Looking-up table, the vector that each Chinese character is store in table indicate.The initial value of vector can be carried out just by random number Beginningization or in advance on a large scale without being trained on mark corpus with tool.During model training, all numerical value of vector As the parameter of model, optimize together in company with other parameters in an iterative process；In the word sequence of given Chinese sentence, crosses and look into Table obtains corresponding word vector and indicates.

It is calculated by following equation

Wherein W and b is model parameter.Above formula is exactly that character is mapped on label, and final sequence is exactly by marking The label composition that label are concentrated.

Entity mark: it is final to be decoded using CRF layers, so that model is learnt to the dependence between label and label.

It is as follows to solve calculation formula:

Penalty values are as follows:

Trained optimization aim is to minimize this penalty values.

Based on the annotator locally learnt:

The basic ideas locally learnt are that the imperfect mark sentence in the labeled data of part is converted to multipath mark sentence, It is to be improved to above-mentioned CRF layers of optimization aim；Basic model is used as using based on BiLSTM-CRF model.

2. the method as described in claim 1 for completing simple event extraction using part study, which is characterized in that for complete The definition of probability for marking sentence is as follows:

3. the method as described in claim 1 for completing simple event extraction using part study, which is characterized in that and for not The definition of probability of complete mark sentence is as follows:

Wherein, D indicates multipath annotated sequence.The conditional probability of namely one trained example is that multipath includes all paths The sum of probability.Then parameter Estimation can be completed with solution mode identical with baseline system.

4. the method as described in claim 1 for completing simple event extraction using part study, which is characterized in that wherein lose Function is defined as follows:

Loss (θ, X, D)=- logp (D | X).

5. the method as described in claim 1 for using part study to complete simple event extraction, which is characterized in that specific next It says, if a word no specified label, label of the word in the mark of part are labeled as UKN, indicate all labels all It is possible that.

6. the method as described in claim 1 for completing simple event extraction using part study, which is characterized in that basic herein On, the optimization object function based on multipath mark sentence is designed, so that efficiently using part labeled data carries out simulated training.

7. the method as described in claim 1 for completing simple event extraction using part study, which is characterized in that model is main There are two states of training and prediction.It needs to input the data with mark in the training process, model needs to constantly update parameter； Under optimization aim, so that the annotation results and true value of output are as consistent as possible；This need by ceaselessly loop iteration come These parameters are updated, so that the loss value in above-mentioned formula constantly reduces, allow model that can acquire better parameter；Another shape State is prediction, and the input during predicting is the data without mark, and what is at this moment used is exactly trained model.In this mistake Undated parameter is not needed in journey, using the output of model as last prediction result.

8. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes any one of claims 1 to 7 the method when executing described program Step.

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of any one of claims 1 to 7 the method is realized when row.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit requires 1 to 7 described in any item methods.