CN112541341A - Text event element extraction method - Google Patents

Text event element extraction method Download PDF

Info

Publication number
CN112541341A
CN112541341A CN202011510822.6A CN202011510822A CN112541341A CN 112541341 A CN112541341 A CN 112541341A CN 202011510822 A CN202011510822 A CN 202011510822A CN 112541341 A CN112541341 A CN 112541341A
Authority
CN
China
Prior art keywords
event
bert model
sequence
text
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011510822.6A
Other languages
Chinese (zh)
Inventor
苏华权
周昉昉
廖鹏
蔡雄
易仕敏
彭泽武
杨秋勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202011510822.6A priority Critical patent/CN112541341A/en
Publication of CN112541341A publication Critical patent/CN112541341A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text event extraction method, which relates to the technical field of computers, and is characterized in that a first sequence trained by text input is labeled with a BERT model to obtain a plurality of trigger words, the plurality of trigger words and a second sequence trained by the text where the plurality of trigger words are located are labeled with the BERT model to obtain event elements corresponding to the plurality of trigger words, and an event element set is generated, so that the applicability and the accuracy of event extraction are improved.

Description

Text event element extraction method
Technical Field
The invention relates to the technical field of computers, in particular to a text event element extraction method.
Background
Event element extraction is one of basic tasks in the field of natural language processing and is also an important subtask in an information extraction task. Event element extraction is intended to extract the most important event elements in a text, and the specific main work is to identify event elements which occur and various elements in the event elements from a piece of text. For example, a trigger word and an event element in one text are extracted, and the event element comprises an event subject, an event object, time, place and the like.
The existing event element extraction scheme mainly uses a self-defined trigger word and extracts event elements based on a machine learning mode, and converts the event element extraction process into a classification problem.
Disclosure of Invention
In order to solve the defects of the prior art, an embodiment of the present invention provides a text event element extraction method, including the following steps:
marking a BERT model on a first sequence trained by text input to obtain a plurality of trigger words;
and inputting the plurality of trigger words and the text where the plurality of trigger words are located into a trained second sequence and labeling a BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
Preferably, after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
Preferably, according to the syntactic dependency, respectively determining whether each event element correctly includes:
and when the syntactic dependency relationship is the subject of the dominance relationship, manually judging whether the corresponding event body in the event element set is really the event body in the text, and if not, filtering the event elements.
Preferably, respectively determining whether each event element is correct according to the syntactic dependency further includes:
and when the syntactic dependency relationship is a subject of the actor-guest relationship, manually judging whether the corresponding event object in the event element set is really an event object in the text, and if not, filtering the event element.
Preferably, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into a sequence labeling BERT model, and training the sequence labeling BERT model to obtain a trained first sequence labeling BERT model.
Preferably, the training process of the second sequence labeling BERT model includes:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
Preferably, the training process of the sequence labeling BERT model includes:
and taking a plurality of sentence-level texts carrying event element labels as training data to input a sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
The text event element extraction method provided by the embodiment of the invention has the following beneficial effects:
the method has the advantages that the trigger words are predicted by marking the BERT model through the trained first sequence, the event elements are predicted by marking the BERT model through the trained second sequence, the method is suitable for linguistic data of various sources, and the accuracy rate of extracting the event elements is high.
Detailed Description
The present invention will be described in detail with reference to the following embodiments.
The text event element extraction method provided by the embodiment of the invention comprises the following steps:
s101, marking the first sequence trained by text input to a BERT model to obtain a plurality of trigger words.
Wherein the first sequence labeling BERT model utilizes an encoder structure of a Transformer model. The Transformer model is an attention mechanism and can learn the context relationship between words in a text. The prototype of the Transformer model comprises two independent structures, wherein one structure is an encoder structure and is responsible for receiving a text as an input; a decoder structure responsible for predicting the outcome of the task.
And S102, inputting the plurality of trigger words and the text where the plurality of trigger words are located into the trained second sequence and labeling the BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
The specific process for extracting the event elements comprises the following steps:
the embedding layer of the second sequence labeling BERT model converts an input text into three embedding characteristics of sub-word embedding, position embedding and segmentation embedding, the position of a trigger word in the sub-word embedding characteristics is replaced by 1, and the coding layer constructs a vector representation representing the semantics of each character to be classified based on the semantic vector of the sub-word output by the embedding layer. And the output layer finally inputs the vector representation corresponding to each word into a full-connection layer for multi-classification, and the class with the highest probability is taken as the classification mark of the word.
Optionally, after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
Optionally, according to the syntactic dependency, it is respectively determined whether each event element correctly includes:
and when the syntactic dependency relationship is the subject of the primary predicate relationship, manually judging whether the corresponding event body in the event elements is really the event body in the text, and if not, filtering the event elements.
Optionally, the determining whether each event element is correct according to the syntactic dependency relationship further includes:
and when the syntactic dependency relationship is the subject of the actor-guest relationship, manually judging whether the corresponding event object in the event element is really the event object in the text, and if not, filtering the event element.
As a specific embodiment, for a text that "north june aggregates 100 airplanes in iraq," the first sequence is labeled with BERT model to predict "aggregation" as a trigger, the first sequence is labeled with BERT model to predict "north june" as an event subject and "100 airplanes" as event objects, and then the ltp model is used to judge that the syntactic dependency relationship between "aggregation" and "north june" is a major-minor relationship and the syntactic dependency relationship between "airplanes" is a guest relationship, and then "north june" is confirmed as an event subject and "100 airplanes" as event objects.
Optionally, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into the sequence label BERT model, and training the sequence label BERT model to obtain a trained first sequence label BERT model.
Optionally, the training process of the second sequence labeling BERT model includes:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
Wherein the CRF layer of the conditional random field CRF model is used to learn the relationships between different labels, rather than making independent predictions.
Optionally, the training process of the sequence labeling BERT model includes:
and inputting a plurality of sentence-level texts carrying event element labels as training data into the sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
As a specific embodiment, the labeled data is processed into a sequence labeled format by using an IOB label, wherein the label I is used for identifying characters in a text block, the label O is used for identifying characters outside the text block, and the label B is used for identifying the first character of the same type of text block which is followed by the text block.
According to the text event element extraction method provided by the embodiment of the invention, a plurality of trigger words are obtained by labeling the first sequence trained by text input to the BERT model, a plurality of trigger words and the second sequence trained by the text where the trigger words are located are labeled to the BERT model, event elements corresponding to the trigger words are obtained, an event element set is generated, and the applicability and the event element extraction accuracy are improved.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the method and apparatus described above are referred to one another.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (7)

1. A text event element extraction method is characterized by comprising the following steps:
marking a BERT model on a first sequence trained by text input to obtain a plurality of trigger words;
and inputting the plurality of trigger words and the text where the plurality of trigger words are located into a trained second sequence and labeling a BERT model to obtain event elements corresponding to the plurality of trigger words and generate an event element set, wherein the event elements comprise an event subject, an event object, time and a place.
2. The textual event element extraction method of claim 1, wherein after generating the set of event elements, the method further comprises:
obtaining a syntactic dependency relationship between the trigger word and each event element by utilizing an ltp model of a language technology platform;
and respectively judging whether each event element is correct or not according to the syntactic dependency relationship.
3. The method of claim 2, wherein determining whether each event element is correct according to the syntactic dependency comprises:
and when the syntactic dependency relationship is the subject of the dominance relationship, manually judging whether the corresponding event body in the event elements is really the event body in the text, and if not, filtering the event elements.
4. The method of extracting textual event elements according to claim 2, wherein determining whether each of the event elements is correct according to the syntactic dependency further comprises:
and when the syntactic dependency relationship is a subject of the actor-guest relationship, manually judging whether the corresponding event object in the event elements is really the event object in the text, and if not, filtering the event elements.
5. The method of extracting text event elements according to claim 1, wherein the training process of the sequence labeling BERT model comprises:
and inputting a plurality of sentence-level texts carrying trigger word labels as training data into a sequence labeling BERT model, and training the sequence labeling BERT model to obtain a trained first sequence labeling BERT model.
6. The text event element extraction method of claim 1, wherein the training process of the second sequence labeling BERT model comprises:
and adding a CRF layer of the conditional random field CRF model to the trained sequence label BERT model to obtain a trained second sequence label BERT model.
7. The method of extracting text event elements according to claim 6, wherein the training process of the sequence labeling BERT model comprises:
and taking a plurality of sentence-level texts carrying event element labels as training data to input a sequence label BERT model, and training the sequence label BERT model to obtain a trained second sequence label BERT model.
CN202011510822.6A 2020-12-18 2020-12-18 Text event element extraction method Pending CN112541341A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011510822.6A CN112541341A (en) 2020-12-18 2020-12-18 Text event element extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011510822.6A CN112541341A (en) 2020-12-18 2020-12-18 Text event element extraction method

Publications (1)

Publication Number Publication Date
CN112541341A true CN112541341A (en) 2021-03-23

Family

ID=75019132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011510822.6A Pending CN112541341A (en) 2020-12-18 2020-12-18 Text event element extraction method

Country Status (1)

Country Link
CN (1) CN112541341A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254628A (en) * 2021-05-18 2021-08-13 北京中科智加科技有限公司 Event relation determining method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679850A (en) * 2015-02-13 2015-06-03 深圳市华傲数据技术有限公司 Address structuring method and device
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN111222317A (en) * 2019-10-16 2020-06-02 平安科技(深圳)有限公司 Sequence labeling method, system and computer equipment
CN111651986A (en) * 2020-04-28 2020-09-11 银江股份有限公司 Event keyword extraction method, device, equipment and medium
CN111881299A (en) * 2020-08-07 2020-11-03 哈尔滨商业大学 Outlier event detection and identification method based on duplicate neural network
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN112084746A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Entity identification method, system, storage medium and equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679850A (en) * 2015-02-13 2015-06-03 深圳市华傲数据技术有限公司 Address structuring method and device
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN111222317A (en) * 2019-10-16 2020-06-02 平安科技(深圳)有限公司 Sequence labeling method, system and computer equipment
CN111651986A (en) * 2020-04-28 2020-09-11 银江股份有限公司 Event keyword extraction method, device, equipment and medium
CN111881299A (en) * 2020-08-07 2020-11-03 哈尔滨商业大学 Outlier event detection and identification method based on duplicate neural network
CN112084381A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Event extraction method, system, storage medium and equipment
CN112084746A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Entity identification method, system, storage medium and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254628A (en) * 2021-05-18 2021-08-13 北京中科智加科技有限公司 Event relation determining method and device

Similar Documents

Publication Publication Date Title
US20170308790A1 (en) Text classification by ranking with convolutional neural networks
CN111738016A (en) Multi-intention recognition method and related equipment
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN112560504B (en) Method, electronic equipment and computer readable medium for extracting information in form document
CN110569330A (en) text labeling system, device, equipment and medium based on intelligent word selection
CN108205524B (en) Text data processing method and device
CN116152843B (en) Category identification method, device and storage medium for contract template to be filled-in content
CN112101526A (en) Knowledge distillation-based model training method and device
CN113221555A (en) Keyword identification method, device and equipment based on multitask model
CN112287100A (en) Text recognition method, spelling error correction method and voice recognition method
Moeng et al. Canonical and surface morphological segmentation for nguni languages
CN113222022A (en) Webpage classification identification method and device
CN112395412A (en) Text classification method, device and computer readable medium
CN116228383A (en) Risk prediction method and device, storage medium and electronic equipment
Shi et al. A brief survey of relation extraction based on distant supervision
CN112541341A (en) Text event element extraction method
CN117787226A (en) Label generation model training method and device, electronic equipment and storage medium
CN111062204B (en) Text punctuation use error identification method and device based on machine learning
CN113051910A (en) Method and device for predicting emotion of character role
CN115204164B (en) Method, system and storage medium for identifying communication sensitive information of power system
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN115640810A (en) Method, system and storage medium for identifying communication sensitive information of power system
CN111461330B (en) Multilingual knowledge base construction method and system based on multilingual resume
CN112256841B (en) Text matching and countermeasure text recognition method, device and equipment
CN116150308A (en) Training method of recognition model, recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination