CN110968661A

CN110968661A - Event extraction method and system, computer readable storage medium and electronic device

Info

Publication number: CN110968661A
Application number: CN202010141127.0A
Authority: CN
Inventors: 罗镇权; 张发展; 李焕; 刘世林
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2020-04-07

Abstract

The invention relates to an event extraction method and system, a computer readable storage medium and an electronic device, wherein the method comprises the following steps: step 1, manually marking prepared linguistic data, including marking each element of an event and marking an event representation phrase of a trigger event; step 2, carrying out sequence labeling on the labeled linguistic data, wherein event representation phrases in the labeled linguistic data are labeled by event types; step 3, inputting the data processed in the step 2 into a model for training to obtain an event extraction model; and 4, inputting the text to be extracted into the event extraction model, and outputting to obtain an identification result. The invention provides a novel event extraction method, which reduces the work which is finished by two models originally to the work which can be finished by only one model, and greatly saves resources and time.

Description

Event extraction method and system, computer readable storage medium and electronic device

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to an event extraction method and system, a computer-readable storage medium, and an electronic device capable of reducing event extraction computing resources.

Background

In the field of knowledge maps, an event is an occurrence or a change in state of one or more actions involving one or more characters occurring at a specific time point or time period and within a specific geographical area. Event extraction refers to extracting event information interested by a user from a natural language text and presenting the event information in a structured form, such as what person/organization, what time, where, what is done.

BERT (bidirectional Encoder retrieval from transformers) is a large-scale pre-training language model based on bidirectional transformers issued by Google, which can efficiently extract text information and apply to various NLP tasks, the research refreshes the current optimal performance records of 11 NLP tasks by virtue of the pre-training model, and the model can be understood as a language Encoder to convert input sentences or paragraphs into feature vectors.

Currently, event extraction is generally performed through the following processes: firstly, preparing a training corpus; secondly, judging whether the linguistic data (chapter level or sentence level) contains events or not and event types through a text classification model, namely classifying the events; and finally, extracting the event elements from the corpus containing the events by using an event element extraction model to complete the whole event extraction process.

From the above description, it can be seen that two models are required to complete event extraction: an event classification model and an event element extraction model. The deep learning model generally requires a large amount of computing resources, such as CPU computing resources, memory resources, GPU computing resources, GPU video memory, and the like, and especially in recent years, the pretraining language model represented by BERT is known to consume a large amount of video memory. Currently, when an event extraction task is performed, two BERT models need to be used simultaneously, so that a large amount of resources and time are consumed.

Disclosure of Invention

The present invention is directed to overcome the disadvantages of the prior art that event extraction requires a large amount of resources and time, and provides an event extraction method and system, a computer-readable storage medium, and an electronic device, which can reduce event extraction computing resources and time consumption.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

in one aspect, an embodiment of the present invention provides an event extraction method, including the following steps:

step 1, manually labeling the prepared text corpus, including labeling each element of an event and labeling an event representation phrase of a trigger event;

step 2, carrying out sequence labeling on the labeled linguistic data, wherein event representation phrases in the labeled linguistic data are labeled by event types;

step 3, inputting the data processed in the step 2 into a model for training to obtain an event extraction model;

and 4, inputting the text to be extracted into the event extraction model, and outputting to obtain an identification result.

According to the method, the elements of the event and the event representation phrases are labeled at the same time, then sequence labeling is carried out, but the event representation phrases are not labeled at the same time during sequence labeling, but event type labeling is carried out, the event type and the event elements can be simultaneously identified and extracted through an event extraction model obtained through the training after the processing, tasks completed by two models in the prior art can be realized through one model, the consumption of computing resources is greatly reduced, and the event extraction time is greatly reduced.

And in the step 2, a BEIO sequence labeling method is adopted to label the labeled corpus.

The model in the step 3 is a BERT model or a bilstm + crf model.

In the step 4, when the recognition result output by the event extraction model includes two or more event types, the first event type is taken as the standard.

On the other hand, an embodiment of the present invention further provides an event extraction system, including:

the corpus tagging module is used for tagging the prepared text corpus twice, wherein the tagging for one time comprises tagging each element of an event and tagging an event representation phrase triggering the event; the secondary labeling is to perform sequence labeling on the linguistic data subjected to the primary labeling, and the event representation phrases in the linguistic data subjected to the secondary labeling are labeled by event types;

the model training module is used for inputting the data subjected to secondary labeling into a model for training to obtain an event extraction model;

and the event extraction model is used for performing event element identification and event type identification on the input text to be extracted and outputting to obtain an identification result.

And the corpus labeling module is used for labeling the once labeled corpus by adopting a BEIO sequence labeling method.

The model used in the model training module is a BERT model or a bilstm + crf model.

In still another aspect, the present invention also provides a computer-readable storage medium including computer-readable instructions, which, when executed, cause a processor to perform the operations of the method described in the present invention.

In another aspect, an embodiment of the present invention also provides an electronic device, including: a memory storing program instructions; and the processor is connected with the memory and executes the program instructions in the memory to realize the steps of the method in the embodiment of the invention.

Compared with the prior art, the event extraction method is a novel event extraction method, the events can be extracted by only one model, and compared with the method that two models are needed, the method greatly reduces resource consumption and accelerates the event extraction efficiency. In addition, through test comparison, the method can also obviously improve the accuracy of the event extraction result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of an event extraction method described in the embodiment.

FIG. 2 is a diagram illustrating corpora labeled manually in the embodiment.

Fig. 3 is a schematic diagram of an event extraction system according to an embodiment.

Fig. 4 is a block diagram showing the components of the electronic apparatus described in the embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present embodiment schematically provides an event extraction method, including the following steps:

step 1, manually labeling the prepared text corpus, including labeling each element of an event and labeling an event representation phrase of a trigger event.

And 2, processing the labeled corpus into structured data, namely performing sequence labeling, wherein event representation phrases in the corpus are labeled by event types.

The BEIO is a sequence labeling method with wide application, so the sequence labeling method is adopted in the step for data conversion. The BEIO is a commonly used label representation method for sequence labeling, and is mainly used in named entity identification (NER) of sequence labeling. And (3) sequence labeling: in brief, the sequence notation is: given a sequence, each element in the sequence is marked or tagged. In general, a sequence refers to a sentence, and an element refers to a word in the sentence. B = begin, E = end, I = intermediate, representing the beginning, end and middle of a word, respectively, O = other, representing that the above three cases are not met. Such as "i am in all. ", an address is to be noted, i.e., as" OOBO EO ".

However, in this step, not all data in the corpus are BIEO converted, but the EVENT-characterized phrases are converted in an EVENT _ type manner. For example, for the FRAUD EVENT type, it is converted into "EVENT _ FRAUD" or "E _ FRAUD".

And 3, inputting the processed structured data into a model for training to obtain an event extraction model. And (4) the model predicts that the annotation symbol of the first event characterization phrase is the same as the annotation corpus, and then the event type prediction is correct.

The model here can be selected as long as the model can be used for sequence labeling, for example, a BERT model or a bilstm + crf model can be used, and the two models have the best effect at present.

And 4, inputting the text to BE extracted into the EVENT extraction model obtained after training for prediction, wherein the EVENT extraction model identifies EVENT elements of the sample corpus according to the BE, BIE, BIIE, B and the like, and identifies the EVENT type according to the EVENT _ type mode.

In the test process, the parameters of the trained model can be or may be adjusted by using the test sample, but in the practical application, the document to be extracted is directly input into the model, and the recognition result is output.

By the method, the two original models can be reduced into one model to achieve the purpose of event extraction.

Test examples

Referring to fig. 2, for example, the text corpus "in which the AA technology falsely increases the total profit by 8000 ten thousand yuan in a manner of fictitious customers and counterfeit contracts, and virtually increases the bank deposit by about 2.18 million yuan and virtually advances the engineering money by 3.1 ten thousand yuan, which leads to a false record in the year 2015 report" is taken as an example.

Firstly, manually marking, namely marking each element of an event, wherein if the 'AA science and technology' is marked as an event main body, the '2015 year' is marked as the occurrence time, and meanwhile, the '8000 ten thousand yuan of gross profit increase in an equivalent mode of counterfeit combination' is marked as an event representation phrase, and the 'financial fraud' is marked.

The artificially annotated text corpus is then processed into structured data resulting in "O | O | B _ Sub | I | E | O | O | O | O | O | O | O | O | O | O | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | E _ FRAUD | O | O | O | O | O | O | O | O | O | O | O | O | O | O | O | O | O |. Each word (including punctuation) corresponds to a label and "|" is a separator. For example, "AA technology" corresponds to "B _ Sub | I | E", "forgery-contract equally falsely increases the total profit by 8000 ten thousand yuan" corresponds to "E _ frame | E _ ud | E _ frame | 2015 year" corresponds to "B _ TIME | I _ TIME | E _ TIME".

It should be noted that although the event elements are all converted, the converted representations of different entity names are different. In addition, different indications are also used for different event types, such as E _ FRAUD for FRAUD type, and E _ evade tables for evasion type.

And finally, inputting the structured data obtained by processing into a model, and training to obtain an event extraction model.

And after the trained event extraction model is obtained, performing event extraction on the text.

The event extraction model is regarded as an event with the largest probability value of the output event, the event is labeled according to words, the label is ' E _1| E _1| E _1| E _1| E _1| E _1| E _1| E _1| assuming that the event representation phrase has 10 words, and the label is ' E _1| E _1| E _1| E _1| E _1| ' the event extraction model may not achieve the perfect label identification, and a plurality of event identification results may occur, wherein the first event is taken as a reference. For example, "E _1| 0| 0| E _2| 0| 0| 0| 0", so the first E _1 is considered to be an E _1 event, the fifth E _2 is considered to be an E _2 event, and the first E _1 event is considered to be an E _1 event, and experiments using the BERT model prove that the effect is good.

In addition, through test detection, the method and the device can not only greatly improve the event extraction efficiency, but also improve the accuracy of the extraction result. As shown in the table below, the accuracy of event extraction using the method of the present invention is significantly higher than that of the conventional method based on the same test data set.

	The method of the invention	Conventional methods
			Event class average F1 value	85%	76%
Event element extraction average F1 value	55.64%	53.33%

Note: f1 value = correct rate recall 2/(correct rate + recall) is used to evaluate the merits of the different algorithms.

Referring to fig. 3, an event extraction system is also provided in the embodiment of the present invention, and includes a corpus tagging module, a model training module, and an event extraction model.

The corpus tagging module is used for tagging the prepared text corpus twice, wherein the tagging for one time comprises tagging each element of an event and tagging an event representation phrase triggering the event; and the secondary labeling is to perform sequence labeling on the linguistic data subjected to the primary labeling, and the event representation phrases in the linguistic data subjected to the labeling are labeled by event types. In the secondary labeling, EVENT type labeling is carried out on the EVENT representation phrases, such as "EVENT _ EVENT type" or "E _ EVENT type", and sequence labeling is carried out on characters (including characters, punctuation marks, English characters and numeric characters) outside the EVENT representation phrases, such as BIEO sequence labeling.

And the model training module is used for inputting the data subjected to secondary labeling into the model for training to obtain an event extraction model. The model used here is preferably the BERT model. The model used here is the BERT model or the bilstm + crf model.

The event extraction model is used for carrying out event element identification and event type identification on the input text to be extracted and outputting an identification result. The event extraction model is regarded as an event according to the maximum probability value of the output event, and when the output recognition result comprises two or more event types, the first event type is taken as the standard.

The event extraction system is based on the same inventive concept of the aforementioned event extraction method, therefore, the related description of the event extraction method can be referred to for what is not referred to in the description of the system, and the same applies to what is referred to in the description of the system.

As shown in fig. 4, the present embodiment also provides an electronic device, which may include a processor 51 and a memory 52, wherein the memory 52 is coupled to the processor 51. It is noted that the figure is exemplary and that other types of structures may be used in addition to or in place of the structure to implement text labeling, data conversion, communication, or other functionality.

As shown in fig. 4, the electronic device may further include: an input unit 53, a display unit 54, and a power supply 55. It is to be noted that the electronic device does not necessarily have to comprise all the components shown in fig. 4. Furthermore, the electronic device may also comprise components not shown in fig. 4, reference being made to the prior art.

The processor 51, also sometimes referred to as a controller or operational control, may comprise a microprocessor or other processor device and/or logic device, the processor 51 receiving input and controlling operation of the various components of the electronic device.

The memory 52 may be one or more of a buffer, a flash memory, a hard drive, a removable medium, a volatile memory, a non-volatile memory, or other suitable devices, and may store the configuration information of the processor 51, the instructions executed by the processor 51, the recorded table data, and other information. The processor 51 may execute a program stored in the memory 52 to realize information storage or processing, or the like. In one embodiment, a buffer memory, i.e., a buffer, is also included in the memory 52 to store the intermediate information.

The input unit 53 is for example used to provide the processor 51 with text data to be annotated. The display unit 54 is used for displaying various results in the process, such as input text data, the converted multi-dimensional vector, the calculated distance value, etc., and may be, for example, an LCD display, but the present invention is not limited thereto. The power supply 55 is used to provide power to the electronic device.

Embodiments of the present invention further provide a computer readable instruction, where when the instruction is executed in an electronic device, the program causes the electronic device to execute the operation steps included in the method of the present invention.

Embodiments of the present invention further provide a storage medium storing computer-readable instructions, where the computer-readable instructions cause an electronic device to execute the operation steps included in the method of the present invention.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that the various illustrative modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An event extraction method, comprising the steps of:

2. The event extraction method according to claim 1, wherein in the step 2, a BEIO sequence notation method is adopted to label the labeled corpus.

3. The event extraction method as claimed in claim 1, wherein the model in the step 3 is a BERT model or a bilstm + crf model.

4. The event extraction method according to claim 1, wherein in the step 4, when the recognition result output by the event extraction model includes two or more event types, the first event type is used as a criterion.

5. An event extraction system, comprising:

6. The event extraction system according to claim 5, wherein the corpus labeling module labels the once labeled corpus by adopting a BEIO sequence labeling method.

7. The event extraction system according to claim 5, wherein the model used in the model training module is a BERT model or a bilstm + crf model.

8. A computer readable storage medium comprising computer readable instructions that, when executed, cause a processor to perform the operations of the method of any of claims 1-4.

9. An electronic device, comprising:

a memory storing program instructions;

a processor coupled to the memory and executing the program instructions in the memory to implement the steps of the method of any of claims 1-4.