CN112541341A - Text event element extraction method - Google Patents
Text event element extraction method Download PDFInfo
- Publication number
- CN112541341A CN112541341A CN202011510822.6A CN202011510822A CN112541341A CN 112541341 A CN112541341 A CN 112541341A CN 202011510822 A CN202011510822 A CN 202011510822A CN 112541341 A CN112541341 A CN 112541341A
- Authority
- CN
- China
- Prior art keywords
- event
- bert model
- sequence
- text
- trained
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims description 32
- 238000002372 labelling Methods 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 15
- 238000001914 filtration Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 241001347978 Major minor Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text event extraction method, which relates to the technical field of computers, and is characterized in that a first sequence trained by text input is labeled with a BERT model to obtain a plurality of trigger words, the plurality of trigger words and a second sequence trained by the text where the plurality of trigger words are located are labeled with the BERT model to obtain event elements corresponding to the plurality of trigger words, and an event element set is generated, so that the applicability and the accuracy of event extraction are improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a text event element extraction method.
Background
Event element extraction is one of basic tasks in the field of natural language processing and is also an important subtask in an information extraction task. Event element extraction is intended to extract the most important event elements in a text, and the specific main work is to identify event elements which occur and various elements in the event elements from a piece of text. For example, a trigger word and an event element in one text are extracted, and the event element comprises an event subject, an event object, time, place and the like.
The existing event element extraction scheme mainly uses a self-defined trigger word and extracts event elements based on a machine learning mode, and converts the event element extraction process into a classification problem.
Disclosure of Invention
In order to solve the defects of the prior art, an embodiment of the present invention provides a text event element extraction method, including the following steps:
marking a BERT model on a first sequence trained by text input to obtain a plurality of trigger words;
and inputting the plurality of trigger words and the text where the plurality of trigger words are located into a trained second sequence and labeling a BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
Preferably, after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
Preferably, according to the syntactic dependency, respectively determining whether each event element correctly includes:
and when the syntactic dependency relationship is the subject of the dominance relationship, manually judging whether the corresponding event body in the event element set is really the event body in the text, and if not, filtering the event elements.
Preferably, respectively determining whether each event element is correct according to the syntactic dependency further includes:
and when the syntactic dependency relationship is a subject of the actor-guest relationship, manually judging whether the corresponding event object in the event element set is really an event object in the text, and if not, filtering the event element.
Preferably, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into a sequence labeling BERT model, and training the sequence labeling BERT model to obtain a trained first sequence labeling BERT model.
Preferably, the training process of the second sequence labeling BERT model includes:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
Preferably, the training process of the sequence labeling BERT model includes:
and taking a plurality of sentence-level texts carrying event element labels as training data to input a sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
The text event element extraction method provided by the embodiment of the invention has the following beneficial effects:
the method has the advantages that the trigger words are predicted by marking the BERT model through the trained first sequence, the event elements are predicted by marking the BERT model through the trained second sequence, the method is suitable for linguistic data of various sources, and the accuracy rate of extracting the event elements is high.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The text event element extraction method provided by the embodiment of the invention comprises the following steps:
s101, marking the first sequence trained by text input to a BERT model to obtain a plurality of trigger words.
Wherein the first sequence labeling BERT model utilizes an encoder structure of a Transformer model. The Transformer model is an attention mechanism and can learn the context relationship between words in a text. The prototype of the Transformer model comprises two independent structures, wherein one structure is an encoder structure and is responsible for receiving a text as an input; a decoder structure responsible for predicting the outcome of the task.
And S102, inputting the plurality of trigger words and the text where the plurality of trigger words are located into the trained second sequence and labeling the BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
The specific process for extracting the event elements comprises the following steps:
the embedding layer of the second sequence labeling BERT model converts an input text into three embedding characteristics of sub-word embedding, position embedding and segmentation embedding, the position of a trigger word in the sub-word embedding characteristics is replaced by 1, and the coding layer constructs a vector representation representing the semantics of each character to be classified based on the semantic vector of the sub-word output by the embedding layer. And the output layer finally inputs the vector representation corresponding to each word into a full-connection layer for multi-classification, and the class with the highest probability is taken as the classification mark of the word.
Optionally, after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
Optionally, according to the syntactic dependency, it is respectively determined whether each event element correctly includes:
and when the syntactic dependency relationship is the subject of the primary predicate relationship, manually judging whether the corresponding event body in the event elements is really the event body in the text, and if not, filtering the event elements.
Optionally, the determining whether each event element is correct according to the syntactic dependency relationship further includes:
and when the syntactic dependency relationship is the subject of the actor-guest relationship, manually judging whether the corresponding event object in the event element is really the event object in the text, and if not, filtering the event element.
As a specific embodiment, for a text that "north june aggregates 100 airplanes in iraq," the first sequence is labeled with BERT model to predict "aggregation" as a trigger, the first sequence is labeled with BERT model to predict "north june" as an event subject and "100 airplanes" as event objects, and then the ltp model is used to judge that the syntactic dependency relationship between "aggregation" and "north june" is a major-minor relationship and the syntactic dependency relationship between "airplanes" is a guest relationship, and then "north june" is confirmed as an event subject and "100 airplanes" as event objects.
Optionally, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into the sequence label BERT model, and training the sequence label BERT model to obtain a trained first sequence label BERT model.
Optionally, the training process of the second sequence labeling BERT model includes:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
Wherein the CRF layer of the conditional random field CRF model is used to learn the relationships between different labels, rather than making independent predictions.
Optionally, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying event element labels as training data into the sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
As a specific embodiment, the labeled data is processed into a sequence labeled format by using an IOB label, wherein the label I is used for identifying characters in a text block, the label O is used for identifying characters outside the text block, and the label B is used for identifying the first character of the same type of text block which is followed by the text block.
According to the text event element extraction method provided by the embodiment of the invention, a plurality of trigger words are obtained by labeling the first sequence trained by text input to the BERT model, a plurality of trigger words and the second sequence trained by the text where the trigger words are located are labeled to the BERT model, event elements corresponding to the trigger words are obtained, an event element set is generated, and the applicability and the event element extraction accuracy are improved.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (7)
1. A text event element extraction method is characterized by comprising the following steps:
marking a BERT model on a first sequence trained by text input to obtain a plurality of trigger words;
and inputting the plurality of trigger words and the text where the plurality of trigger words are located into a trained second sequence and labeling a BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
2. The textual event element extraction method of claim 1, wherein after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
3. The method of claim 2, wherein determining whether each event element is correct according to the syntactic dependency comprises:
and when the syntactic dependency relationship is the subject of the dominance relationship, manually judging whether the corresponding event body in the event elements is really the event body in the text, and if not, filtering the event elements.
4. The method of extracting textual event elements according to claim 2, wherein determining whether each of the event elements is correct according to the syntactic dependency further comprises:
and when the syntactic dependency relationship is a subject of the actor-guest relationship, manually judging whether the corresponding event object in the event elements is really the event object in the text, and if not, filtering the event elements.
5. The method of extracting text event elements according to claim 1, wherein the training process of the sequence labeling BERT model comprises:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into a sequence labeling BERT model, and training the sequence labeling BERT model to obtain a trained first sequence labeling BERT model.
6. The text event element extraction method of claim 1, wherein the training process of the second sequence labeling BERT model comprises:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
7. The method of extracting text event elements according to claim 6, wherein the training process of the sequence labeling BERT model comprises:
and taking a plurality of sentence-level texts carrying event element labels as training data to input a sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011510822.6A CN112541341A (en) | 2020-12-18 | 2020-12-18 | Text event element extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011510822.6A CN112541341A (en) | 2020-12-18 | 2020-12-18 | Text event element extraction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112541341A true CN112541341A (en) | 2021-03-23 |
Family
ID=75019132
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011510822.6A Pending CN112541341A (en) | 2020-12-18 | 2020-12-18 | Text event element extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112541341A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254628A (en) * | 2021-05-18 | 2021-08-13 | 北京中科智加科技有限公司 | Event relation determining method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679850A (en) * | 2015-02-13 | 2015-06-03 | 深圳市华傲数据技术有限公司 | Address structuring method and device |
CN107122416A (en) * | 2017-03-31 | 2017-09-01 | 北京大学 | A kind of Chinese event abstracting method |
CN111222317A (en) * | 2019-10-16 | 2020-06-02 | 平安科技(深圳)有限公司 | Sequence labeling method, system and computer equipment |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111881299A (en) * | 2020-08-07 | 2020-11-03 | 哈尔滨商业大学 | Outlier event detection and identification method based on duplicate neural network |
CN112084381A (en) * | 2020-09-11 | 2020-12-15 | 广东电网有限责任公司 | Event extraction method, system, storage medium and equipment |
CN112084746A (en) * | 2020-09-11 | 2020-12-15 | 广东电网有限责任公司 | Entity identification method, system, storage medium and equipment |
-
2020
- 2020-12-18 CN CN202011510822.6A patent/CN112541341A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104679850A (en) * | 2015-02-13 | 2015-06-03 | 深圳市华傲数据技术有限公司 | Address structuring method and device |
CN107122416A (en) * | 2017-03-31 | 2017-09-01 | 北京大学 | A kind of Chinese event abstracting method |
CN111222317A (en) * | 2019-10-16 | 2020-06-02 | 平安科技(深圳)有限公司 | Sequence labeling method, system and computer equipment |
CN111651986A (en) * | 2020-04-28 | 2020-09-11 | 银江股份有限公司 | Event keyword extraction method, device, equipment and medium |
CN111881299A (en) * | 2020-08-07 | 2020-11-03 | 哈尔滨商业大学 | Outlier event detection and identification method based on duplicate neural network |
CN112084381A (en) * | 2020-09-11 | 2020-12-15 | 广东电网有限责任公司 | Event extraction method, system, storage medium and equipment |
CN112084746A (en) * | 2020-09-11 | 2020-12-15 | 广东电网有限责任公司 | Entity identification method, system, storage medium and equipment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254628A (en) * | 2021-05-18 | 2021-08-13 | 北京中科智加科技有限公司 | Event relation determining method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20170308790A1 (en) | Text classification by ranking with convolutional neural networks | |
CN111738016A (en) | Multi-intention recognition method and related equipment | |
CN114580424B (en) | Labeling method and device for named entity identification of legal document | |
CN112560504B (en) | Method, electronic equipment and computer readable medium for extracting information in form document | |
CN110569330A (en) | text labeling system, device, equipment and medium based on intelligent word selection | |
CN108205524B (en) | Text data processing method and device | |
CN116152843B (en) | Category identification method, device and storage medium for contract template to be filled-in content | |
CN112101526A (en) | Knowledge distillation-based model training method and device | |
CN113221555A (en) | Keyword identification method, device and equipment based on multitask model | |
CN112287100A (en) | Text recognition method, spelling error correction method and voice recognition method | |
Moeng et al. | Canonical and surface morphological segmentation for nguni languages | |
CN113222022A (en) | Webpage classification identification method and device | |
CN112395412A (en) | Text classification method, device and computer readable medium | |
CN116228383A (en) | Risk prediction method and device, storage medium and electronic equipment | |
Shi et al. | A brief survey of relation extraction based on distant supervision | |
CN112541341A (en) | Text event element extraction method | |
CN117787226A (en) | Label generation model training method and device, electronic equipment and storage medium | |
CN111062204B (en) | Text punctuation use error identification method and device based on machine learning | |
CN113051910A (en) | Method and device for predicting emotion of character role | |
CN115204164B (en) | Method, system and storage medium for identifying communication sensitive information of power system | |
CN110851597A (en) | Method and device for sentence annotation based on similar entity replacement | |
CN115640810A (en) | Method, system and storage medium for identifying communication sensitive information of power system | |
CN111461330B (en) | Multilingual knowledge base construction method and system based on multilingual resume | |
CN112256841B (en) | Text matching and countermeasure text recognition method, device and equipment | |
CN116150308A (en) | Training method of recognition model, recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |