CN110727803A - Text event extraction method and device - Google Patents

Text event extraction method and device Download PDF

Info

Publication number
CN110727803A
CN110727803A CN201910959652.0A CN201910959652A CN110727803A CN 110727803 A CN110727803 A CN 110727803A CN 201910959652 A CN201910959652 A CN 201910959652A CN 110727803 A CN110727803 A CN 110727803A
Authority
CN
China
Prior art keywords
text
events
event
entity
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910959652.0A
Other languages
Chinese (zh)
Inventor
罗华刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910959652.0A priority Critical patent/CN110727803A/en
Publication of CN110727803A publication Critical patent/CN110727803A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text event extraction method and a text event extraction device, wherein the method comprises the following steps: cleaning and word segmentation are carried out on the text to be processed; performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text; performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text; and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure. In the invention, the text information is analyzed from the syntax structure, the event extraction does not need to set a specified mode, the universality is strong, and the text information can be effectively utilized by text completion and entity standardization, so that the event extraction is more reasonable and effective.

Description

Text event extraction method and device
Technical Field
The invention relates to the field of text processing, in particular to a text event extraction method and device.
Background
Event extraction refers to presenting unstructured text containing event information in a structured form. Events are extracted from the mass of text so that the use text data can be analyzed using methods for analyzing structured data. For example, the fire event is extracted from the fire news text, so that people can conveniently research the fire, and can better prevent the fire and prevent the fire from happening before. The existing text event extraction methods generally have two types: an event extraction method based on pattern matching and an event extraction method based on machine learning.
The existing main technical means for extracting the events of the text are two types: the event extraction method based on the pattern matching is based on the event extraction method of the traditional machine learning.
The event extraction method based on pattern matching is to identify and extract events from texts by defining a series of patterns. The event extraction method based on the traditional machine learning is characterized in that the event extraction problem is converted into a classification problem, and event classification and event argument identification are realized through a traditional classification algorithm.
For the event extraction method based on pattern matching, the definition of the pattern depends on the domain knowledge and is realized by indicating the context of the event argument to be extracted. However, the definition of the mode depends on expert knowledge, needs to be established manually, and is high in labor cost and time cost. In addition, migration from one domain to another requires reconstruction due to poor portability of the system due to deterministic patterns. Even in the same field, over time, the technology evolves and the model may no longer be applicable.
For machine learning based methods, although not dependent on the content and format of the corpus, large-scale labeling of the corpus is required. However, manually labeling corpora is time-consuming and labor-intensive. The quality of the corpus directly affects the effect of event extraction, and still needs to define event arguments.
Disclosure of Invention
The embodiment of the invention provides a text event extraction method and a text event extraction device, which are used for at least solving the problem that the universality is lacked because a specified mode needs to be set when text events are extracted in the related technology.
According to an embodiment of the present invention, there is provided a text event extraction method including: cleaning and word segmentation are carried out on the text to be processed; performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text; performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text; and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Optionally, before constructing the graph of the event in an entity-relationship-entity form according to a syntactic structure, the method further includes: comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold; and merging the same events, and storing different events into a database according to the new events.
Optionally, after constructing the graph of the event in the text in an entity-relationship-entity form according to a syntactic structure, the method further includes: and visually displaying the map of the event.
Optionally, the method further comprises: if the event is an existing event in the database, the atlas of the event can be displayed independently or can be displayed after being combined with the existing event in the database.
Optionally, before the text to be processed is cleaned and word-segmented, the method further includes: and acquiring the text to be processed.
According to another embodiment of the present invention, there is provided a text event extraction apparatus including: the word segmentation module is used for cleaning and segmenting the text to be processed; the syntactic analysis module is used for carrying out dependency syntactic analysis on the text after word segmentation so as to obtain sentence components of each sentence in the text; the supplement module is used for performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text so as to extract events in the text; and the construction module is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Optionally, the apparatus further comprises: the comparison module is used for comparing the similarity of the extracted events with the events in the database, and when the similarity exceeds a set threshold, the same events are determined; and the merging module is used for merging the same events and storing different events into a database according to the new events.
Optionally, the apparatus further comprises: and the display module is used for visually displaying the map of the event.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the invention, the text information is analyzed from the syntax structure, and the event extraction does not need to set a specified mode, so that the method has strong universality. Text completion and entity standardization can effectively utilize text information, so that event extraction is more reasonable and effective.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a text event extraction method according to an embodiment of the invention;
FIG. 2 is a flow diagram of a method of text event extraction according to an alternative embodiment of the invention;
FIG. 3 is a text event map presentation diagram according to an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a text event extraction device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a text event extraction device according to an alternative embodiment of the invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a text event extraction method is provided, and fig. 1 is a flowchart of a method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, cleaning and word segmentation are carried out on the text to be processed;
step S104, performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text;
step S106, performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text;
and S108, constructing a map of the event in an entity-relation-entity form according to the syntactic structure.
Before step S102 in this embodiment, the method may further include: and acquiring the text to be processed.
Before step S108 in this embodiment, the method may further include: comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold; and merging the same events, and storing different events into a database according to the new events.
After step S108 of this embodiment, the method may further include: and visually displaying the map of the event.
In this embodiment, the method may further include: if the event is an existing event in the database, the map of the event can be displayed independently or after being combined with the existing event in the database.
In order to facilitate understanding of the technical solutions provided by the embodiments of the present invention, the following detailed description is given with reference to embodiments of specific application scenarios.
The embodiment provides a text event extraction and visualization method. In the embodiment, the event information can be comprehensively and accurately extracted.
As shown in fig. 2, the present embodiment mainly includes the following steps:
step S201, text data to be processed is acquired.
Step S202, cleaning and word segmentation are carried out on the text data.
And cleaning the text to be processed, for example, removing illegal characters, unifying punctuations and the like, and performing word segmentation on the cleaned text.
Step S203, dependency parsing.
And performing dependency syntactic analysis and labeling on all sentences of the text, and acquiring a main and subordinate component in each sentence or further acquiring a bindinglike supplementary component.
Step S204, text supplement and entity normalization.
According to the knowledge base and the context of the text, the text is supplemented and entity normalized according to certain rules, including but not limited to: fixed language expansion, subject-predicate object completion, shape language attribution and the like.
For example, for a sentence: "the cause of the accident is still under investigation", and the fixed language is expanded according to the rule. The term "accident" is a definite term of "reason", and "accident having reason" can be developed.
For example, for a sentence: "3 people are injured and 4 people die due to an accident", and the principal guest can be complemented according to the context. If the previous sentence is: accident (subject) causes (predicate) 3 injury (object), then the latter sentence should be completed according to context as: an accident (subject) causes (predicate) 4 deaths (object).
For example, for a sentence: "Bao chicken fire brigade Weibin team broad-Yuan way squad immediately arrive at the scene to carry out rescue … …, squad immediately allots satellite fire station to reinforce", can standardize the entity concerned according to the context. The later "squad" should be unified into "Bao chicken fire brigade Weibin squad Guanyuan Luzhong squad" according to the context.
For example, for a sentence: "fire in the Zhongsen community. Accidents cause 4 deaths ", and the involved entities can be unified according to context and rules. The meaning of the fire disaster is similar to that of the accident, and the accident is changed into the fire disaster.
For example, for a sentence: "3 months and 5 days in the early morning, fire occurs in Zhongsen cell", and the idiom can be attributed according to the rule. "3 months and 5 days in the morning" is a time-like phrase and should be taken as an attribute of "occurrence".
The knowledge base mainly provides some world knowledge, for example, unifying the entity "Beijing university" to "Beijing university". Context-dependent rules include, but are not limited to, the methods or rules mentioned in the above examples.
Step S205, archive integration
And comparing the similarity of the extracted events with the events in the database mainly according to time, place and occurrence events. And if the similarity of the events with the highest similarity exceeds a given threshold, the events are considered to be the same event. If the same event exists, merging the events; if not, storing the event into the database according to the new event.
And step S206, visually displaying.
According to the syntactic structure, the map of the whole event is constructed in the form of entity-relation-entity and visualized. If the events exist in the database, the individual display or the combined display can be selected.
For example, for the sentence "3 months and 5 days in the morning, a fire occurs in a forest cell in a certain city. The accident caused 3 injured people and 4 dead people. Accident causes are still under investigation ". An event graph representation of the sentence is shown in FIG. 3.
Compared with the conventional method, the event extraction in this embodiment does not require setting of a predetermined pattern, and has extremely high versatility. The method does not need to use large-scale anticipation for training, and can completely extract all information in the text.
In the above-described embodiment of the present invention, the text data information is analyzed from the syntax structure without setting a predetermined pattern. The method has the advantages that the method can complement the text and standardize the entity, effectively utilize the text information and enable the event extraction to be more reasonable and effective. The development of the whole event can be mastered more comprehensively by utilizing the historical event information.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a text event extraction device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram illustrating a structure of a text event extracting apparatus according to an embodiment of the present invention, which includes a segmentation module 10, a parsing module 20, a supplementation module 30, and a construction module 40, as shown in fig. 4.
And the word segmentation module 10 is used for cleaning and segmenting the text to be processed.
And the syntactic analysis module 20 is configured to perform dependency syntactic analysis on the segmented text to obtain a sentence component of each sentence in the text.
And the supplement module 30 is used for performing component supplement and entity normalization on the sentence according to the knowledge base and the context of the text so as to extract the events in the text.
And the construction module 40 is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Fig. 5 is a block diagram showing a structure of a text event extracting apparatus according to an alternative embodiment of the present invention, which includes a comparison module 50 and a presentation module 60 in addition to all the modules shown in fig. 4, as shown in fig. 5.
A comparison module 50, configured to compare similarity between the extracted events and events in the database, and determine that the extracted events are the same events when the similarity exceeds a set threshold; and the merging module is used for merging the same events and storing different events into a database according to the new events.
And the display module 60 is used for visually displaying the map of the event.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A text event extraction method is characterized by comprising the following steps:
cleaning and word segmentation are carried out on the text to be processed;
performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text;
performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text;
and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
2. The method of claim 1, further comprising, prior to constructing the graph of the event in an entity-relationship-entity form according to a syntactic structure:
comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold;
and merging the same events, and storing different events into a database according to the new events.
3. The method of claim 1, after constructing the graph of the events in the text in an entity-relationship-entity form according to a syntactic structure, further comprising:
and visually displaying the map of the event.
4. The method of claim 2, further comprising:
if the event is an existing event in the database, the atlas of the event can be displayed independently or can be displayed after being combined with the existing event in the database.
5. The method of claim 1, further comprising, prior to cleansing and tokenizing the text to be processed:
and acquiring the text to be processed.
6. A text event extraction device, comprising:
the word segmentation module is used for cleaning and segmenting the text to be processed;
the syntactic analysis module is used for carrying out dependency syntactic analysis on the text after word segmentation so as to obtain sentence components of each sentence in the text;
the supplement module is used for performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text so as to extract events in the text;
and the construction module is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
7. The apparatus of claim 6, further comprising:
the comparison module is used for comparing the similarity of the extracted events with the events in the database, and when the similarity exceeds a set threshold, the same events are determined;
and the merging module is used for merging the same events and storing different events into a database according to the new events.
8. The apparatus of claim 6, further comprising:
and the display module is used for visually displaying the map of the event.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
CN201910959652.0A 2019-10-10 2019-10-10 Text event extraction method and device Pending CN110727803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910959652.0A CN110727803A (en) 2019-10-10 2019-10-10 Text event extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910959652.0A CN110727803A (en) 2019-10-10 2019-10-10 Text event extraction method and device

Publications (1)

Publication Number Publication Date
CN110727803A true CN110727803A (en) 2020-01-24

Family

ID=69219988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910959652.0A Pending CN110727803A (en) 2019-10-10 2019-10-10 Text event extraction method and device

Country Status (1)

Country Link
CN (1) CN110727803A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100324A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Knowledge graph automatic check iteration method based on greedy entity link
CN113190674A (en) * 2021-05-08 2021-07-30 上海明略人工智能(集团)有限公司 Method, device, electronic equipment and readable storage medium for generating event context
CN114064937A (en) * 2022-01-14 2022-02-18 云孚科技(北京)有限公司 Method and system for automatically constructing case map
CN114880491A (en) * 2022-07-08 2022-08-09 云孚科技(北京)有限公司 Method and system for automatically constructing case map

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information
CN109947897A (en) * 2019-03-15 2019-06-28 南京邮电大学 Judicial case event tree constructs system and method
CN110110870A (en) * 2019-06-05 2019-08-09 厦门邑通软件科技有限公司 A kind of equipment fault intelligent control method based on event graphical spectrum technology
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A kind of entity relationship map construction method and system for Web Community's text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information
CN109947897A (en) * 2019-03-15 2019-06-28 南京邮电大学 Judicial case event tree constructs system and method
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A kind of entity relationship map construction method and system for Web Community's text
CN110110870A (en) * 2019-06-05 2019-08-09 厦门邑通软件科技有限公司 A kind of equipment fault intelligent control method based on event graphical spectrum technology

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100324A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Knowledge graph automatic check iteration method based on greedy entity link
CN113190674A (en) * 2021-05-08 2021-07-30 上海明略人工智能(集团)有限公司 Method, device, electronic equipment and readable storage medium for generating event context
CN114064937A (en) * 2022-01-14 2022-02-18 云孚科技(北京)有限公司 Method and system for automatically constructing case map
CN114880491A (en) * 2022-07-08 2022-08-09 云孚科技(北京)有限公司 Method and system for automatically constructing case map

Similar Documents

Publication Publication Date Title
CN110727803A (en) Text event extraction method and device
CN106649742B (en) Database maintenance method and device
CN110874531B (en) Topic analysis method and device and storage medium
CN108509477B (en) Method for recognizing semantics, electronic device and computer readable storage medium
US10503830B2 (en) Natural language processing with adaptable rules based on user inputs
US10824816B2 (en) Semantic parsing method and apparatus
CN112417846B (en) Text automatic generation method and device, electronic equipment and storage medium
US11556812B2 (en) Method and device for acquiring data model in knowledge graph, and medium
KR20220064016A (en) Method for extracting construction safety accident based data mining using big data
CN110825839B (en) Association relation analysis method for targets in text information
CN111079408B (en) Language identification method, device, equipment and storage medium
CN110765235A (en) Training data generation method and device, terminal and readable medium
CN111125355A (en) Information processing method and related equipment
CN111428503A (en) Method and device for identifying and processing same-name person
CN111966792A (en) Text processing method and device, electronic equipment and readable storage medium
CN111177401A (en) Power grid free text knowledge extraction method
US20200401767A1 (en) Summary evaluation device, method, program, and storage medium
KR102206742B1 (en) Method and apparatus for representing lexical knowledge graph from natural language text
CN109300550B (en) Medical data relation mining method and device
CN114969385B (en) Knowledge graph optimization method and device based on document attribute assignment entity weight
Ishii et al. Causal network construction to support understanding of news
CN115481239A (en) Social governance document abstract extraction method and device and electronic equipment
CN112148838B (en) Service source object extraction method and device
US20120144294A1 (en) Assisting document creation
CN114238654A (en) Knowledge graph construction method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200124

RJ01 Rejection of invention patent application after publication