CN110727803A - Text event extraction method and device - Google Patents
Text event extraction method and device Download PDFInfo
- Publication number
- CN110727803A CN110727803A CN201910959652.0A CN201910959652A CN110727803A CN 110727803 A CN110727803 A CN 110727803A CN 201910959652 A CN201910959652 A CN 201910959652A CN 110727803 A CN110727803 A CN 110727803A
- Authority
- CN
- China
- Prior art keywords
- text
- events
- event
- entity
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a text event extraction method and a text event extraction device, wherein the method comprises the following steps: cleaning and word segmentation are carried out on the text to be processed; performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text; performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text; and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure. In the invention, the text information is analyzed from the syntax structure, the event extraction does not need to set a specified mode, the universality is strong, and the text information can be effectively utilized by text completion and entity standardization, so that the event extraction is more reasonable and effective.
Description
Technical Field
The invention relates to the field of text processing, in particular to a text event extraction method and device.
Background
Event extraction refers to presenting unstructured text containing event information in a structured form. Events are extracted from the mass of text so that the use text data can be analyzed using methods for analyzing structured data. For example, the fire event is extracted from the fire news text, so that people can conveniently research the fire, and can better prevent the fire and prevent the fire from happening before. The existing text event extraction methods generally have two types: an event extraction method based on pattern matching and an event extraction method based on machine learning.
The existing main technical means for extracting the events of the text are two types: the event extraction method based on the pattern matching is based on the event extraction method of the traditional machine learning.
The event extraction method based on pattern matching is to identify and extract events from texts by defining a series of patterns. The event extraction method based on the traditional machine learning is characterized in that the event extraction problem is converted into a classification problem, and event classification and event argument identification are realized through a traditional classification algorithm.
For the event extraction method based on pattern matching, the definition of the pattern depends on the domain knowledge and is realized by indicating the context of the event argument to be extracted. However, the definition of the mode depends on expert knowledge, needs to be established manually, and is high in labor cost and time cost. In addition, migration from one domain to another requires reconstruction due to poor portability of the system due to deterministic patterns. Even in the same field, over time, the technology evolves and the model may no longer be applicable.
For machine learning based methods, although not dependent on the content and format of the corpus, large-scale labeling of the corpus is required. However, manually labeling corpora is time-consuming and labor-intensive. The quality of the corpus directly affects the effect of event extraction, and still needs to define event arguments.
Disclosure of Invention
The embodiment of the invention provides a text event extraction method and a text event extraction device, which are used for at least solving the problem that the universality is lacked because a specified mode needs to be set when text events are extracted in the related technology.
According to an embodiment of the present invention, there is provided a text event extraction method including: cleaning and word segmentation are carried out on the text to be processed; performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text; performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text; and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Optionally, before constructing the graph of the event in an entity-relationship-entity form according to a syntactic structure, the method further includes: comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold; and merging the same events, and storing different events into a database according to the new events.
Optionally, after constructing the graph of the event in the text in an entity-relationship-entity form according to a syntactic structure, the method further includes: and visually displaying the map of the event.
Optionally, the method further comprises: if the event is an existing event in the database, the atlas of the event can be displayed independently or can be displayed after being combined with the existing event in the database.
Optionally, before the text to be processed is cleaned and word-segmented, the method further includes: and acquiring the text to be processed.
According to another embodiment of the present invention, there is provided a text event extraction apparatus including: the word segmentation module is used for cleaning and segmenting the text to be processed; the syntactic analysis module is used for carrying out dependency syntactic analysis on the text after word segmentation so as to obtain sentence components of each sentence in the text; the supplement module is used for performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text so as to extract events in the text; and the construction module is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Optionally, the apparatus further comprises: the comparison module is used for comparing the similarity of the extracted events with the events in the database, and when the similarity exceeds a set threshold, the same events are determined; and the merging module is used for merging the same events and storing different events into a database according to the new events.
Optionally, the apparatus further comprises: and the display module is used for visually displaying the map of the event.
According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the invention, the text information is analyzed from the syntax structure, and the event extraction does not need to set a specified mode, so that the method has strong universality. Text completion and entity standardization can effectively utilize text information, so that event extraction is more reasonable and effective.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a text event extraction method according to an embodiment of the invention;
FIG. 2 is a flow diagram of a method of text event extraction according to an alternative embodiment of the invention;
FIG. 3 is a text event map presentation diagram according to an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a text event extraction device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a text event extraction device according to an alternative embodiment of the invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a text event extraction method is provided, and fig. 1 is a flowchart of a method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, cleaning and word segmentation are carried out on the text to be processed;
step S104, performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text;
step S106, performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text;
and S108, constructing a map of the event in an entity-relation-entity form according to the syntactic structure.
Before step S102 in this embodiment, the method may further include: and acquiring the text to be processed.
Before step S108 in this embodiment, the method may further include: comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold; and merging the same events, and storing different events into a database according to the new events.
After step S108 of this embodiment, the method may further include: and visually displaying the map of the event.
In this embodiment, the method may further include: if the event is an existing event in the database, the map of the event can be displayed independently or after being combined with the existing event in the database.
In order to facilitate understanding of the technical solutions provided by the embodiments of the present invention, the following detailed description is given with reference to embodiments of specific application scenarios.
The embodiment provides a text event extraction and visualization method. In the embodiment, the event information can be comprehensively and accurately extracted.
As shown in fig. 2, the present embodiment mainly includes the following steps:
step S201, text data to be processed is acquired.
Step S202, cleaning and word segmentation are carried out on the text data.
And cleaning the text to be processed, for example, removing illegal characters, unifying punctuations and the like, and performing word segmentation on the cleaned text.
Step S203, dependency parsing.
And performing dependency syntactic analysis and labeling on all sentences of the text, and acquiring a main and subordinate component in each sentence or further acquiring a bindinglike supplementary component.
Step S204, text supplement and entity normalization.
According to the knowledge base and the context of the text, the text is supplemented and entity normalized according to certain rules, including but not limited to: fixed language expansion, subject-predicate object completion, shape language attribution and the like.
For example, for a sentence: "the cause of the accident is still under investigation", and the fixed language is expanded according to the rule. The term "accident" is a definite term of "reason", and "accident having reason" can be developed.
For example, for a sentence: "3 people are injured and 4 people die due to an accident", and the principal guest can be complemented according to the context. If the previous sentence is: accident (subject) causes (predicate) 3 injury (object), then the latter sentence should be completed according to context as: an accident (subject) causes (predicate) 4 deaths (object).
For example, for a sentence: "Bao chicken fire brigade Weibin team broad-Yuan way squad immediately arrive at the scene to carry out rescue … …, squad immediately allots satellite fire station to reinforce", can standardize the entity concerned according to the context. The later "squad" should be unified into "Bao chicken fire brigade Weibin squad Guanyuan Luzhong squad" according to the context.
For example, for a sentence: "fire in the Zhongsen community. Accidents cause 4 deaths ", and the involved entities can be unified according to context and rules. The meaning of the fire disaster is similar to that of the accident, and the accident is changed into the fire disaster.
For example, for a sentence: "3 months and 5 days in the early morning, fire occurs in Zhongsen cell", and the idiom can be attributed according to the rule. "3 months and 5 days in the morning" is a time-like phrase and should be taken as an attribute of "occurrence".
The knowledge base mainly provides some world knowledge, for example, unifying the entity "Beijing university" to "Beijing university". Context-dependent rules include, but are not limited to, the methods or rules mentioned in the above examples.
Step S205, archive integration
And comparing the similarity of the extracted events with the events in the database mainly according to time, place and occurrence events. And if the similarity of the events with the highest similarity exceeds a given threshold, the events are considered to be the same event. If the same event exists, merging the events; if not, storing the event into the database according to the new event.
And step S206, visually displaying.
According to the syntactic structure, the map of the whole event is constructed in the form of entity-relation-entity and visualized. If the events exist in the database, the individual display or the combined display can be selected.
For example, for the sentence "3 months and 5 days in the morning, a fire occurs in a forest cell in a certain city. The accident caused 3 injured people and 4 dead people. Accident causes are still under investigation ". An event graph representation of the sentence is shown in FIG. 3.
Compared with the conventional method, the event extraction in this embodiment does not require setting of a predetermined pattern, and has extremely high versatility. The method does not need to use large-scale anticipation for training, and can completely extract all information in the text.
In the above-described embodiment of the present invention, the text data information is analyzed from the syntax structure without setting a predetermined pattern. The method has the advantages that the method can complement the text and standardize the entity, effectively utilize the text information and enable the event extraction to be more reasonable and effective. The development of the whole event can be mastered more comprehensively by utilizing the historical event information.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a text event extraction device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram illustrating a structure of a text event extracting apparatus according to an embodiment of the present invention, which includes a segmentation module 10, a parsing module 20, a supplementation module 30, and a construction module 40, as shown in fig. 4.
And the word segmentation module 10 is used for cleaning and segmenting the text to be processed.
And the syntactic analysis module 20 is configured to perform dependency syntactic analysis on the segmented text to obtain a sentence component of each sentence in the text.
And the supplement module 30 is used for performing component supplement and entity normalization on the sentence according to the knowledge base and the context of the text so as to extract the events in the text.
And the construction module 40 is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
Fig. 5 is a block diagram showing a structure of a text event extracting apparatus according to an alternative embodiment of the present invention, which includes a comparison module 50 and a presentation module 60 in addition to all the modules shown in fig. 4, as shown in fig. 5.
A comparison module 50, configured to compare similarity between the extracted events and events in the database, and determine that the extracted events are the same events when the similarity exceeds a set threshold; and the merging module is used for merging the same events and storing different events into a database according to the new events.
And the display module 60 is used for visually displaying the map of the event.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A text event extraction method is characterized by comprising the following steps:
cleaning and word segmentation are carried out on the text to be processed;
performing dependency syntax analysis on the text after word segmentation to obtain sentence components of each sentence in the text;
performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text to extract events in the text;
and constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
2. The method of claim 1, further comprising, prior to constructing the graph of the event in an entity-relationship-entity form according to a syntactic structure:
comparing the similarity of the extracted events with the events in the database, and determining the events to be the same events when the similarity exceeds a set threshold;
and merging the same events, and storing different events into a database according to the new events.
3. The method of claim 1, after constructing the graph of the events in the text in an entity-relationship-entity form according to a syntactic structure, further comprising:
and visually displaying the map of the event.
4. The method of claim 2, further comprising:
if the event is an existing event in the database, the atlas of the event can be displayed independently or can be displayed after being combined with the existing event in the database.
5. The method of claim 1, further comprising, prior to cleansing and tokenizing the text to be processed:
and acquiring the text to be processed.
6. A text event extraction device, comprising:
the word segmentation module is used for cleaning and segmenting the text to be processed;
the syntactic analysis module is used for carrying out dependency syntactic analysis on the text after word segmentation so as to obtain sentence components of each sentence in the text;
the supplement module is used for performing component supplement and entity normalization on sentences according to a knowledge base and the context of the text so as to extract events in the text;
and the construction module is used for constructing the map of the event in the form of entity-relation-entity according to the syntactic structure.
7. The apparatus of claim 6, further comprising:
the comparison module is used for comparing the similarity of the extracted events with the events in the database, and when the similarity exceeds a set threshold, the same events are determined;
and the merging module is used for merging the same events and storing different events into a database according to the new events.
8. The apparatus of claim 6, further comprising:
and the display module is used for visually displaying the map of the event.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910959652.0A CN110727803A (en) | 2019-10-10 | 2019-10-10 | Text event extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910959652.0A CN110727803A (en) | 2019-10-10 | 2019-10-10 | Text event extraction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110727803A true CN110727803A (en) | 2020-01-24 |
Family
ID=69219988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910959652.0A Pending CN110727803A (en) | 2019-10-10 | 2019-10-10 | Text event extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110727803A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100324A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Knowledge graph automatic check iteration method based on greedy entity link |
CN113190674A (en) * | 2021-05-08 | 2021-07-30 | 上海明略人工智能(集团)有限公司 | Method, device, electronic equipment and readable storage medium for generating event context |
CN114064937A (en) * | 2022-01-14 | 2022-02-18 | 云孚科技(北京)有限公司 | Method and system for automatically constructing case map |
CN114880491A (en) * | 2022-07-08 | 2022-08-09 | 云孚科技(北京)有限公司 | Method and system for automatically constructing case map |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298635A (en) * | 2011-09-13 | 2011-12-28 | 苏州大学 | Method and system for fusing event information |
CN109947897A (en) * | 2019-03-15 | 2019-06-28 | 南京邮电大学 | Judicial case event tree constructs system and method |
CN110110870A (en) * | 2019-06-05 | 2019-08-09 | 厦门邑通软件科技有限公司 | A kind of equipment fault intelligent control method based on event graphical spectrum technology |
CN110188191A (en) * | 2019-04-08 | 2019-08-30 | 北京邮电大学 | A kind of entity relationship map construction method and system for Web Community's text |
-
2019
- 2019-10-10 CN CN201910959652.0A patent/CN110727803A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298635A (en) * | 2011-09-13 | 2011-12-28 | 苏州大学 | Method and system for fusing event information |
CN109947897A (en) * | 2019-03-15 | 2019-06-28 | 南京邮电大学 | Judicial case event tree constructs system and method |
CN110188191A (en) * | 2019-04-08 | 2019-08-30 | 北京邮电大学 | A kind of entity relationship map construction method and system for Web Community's text |
CN110110870A (en) * | 2019-06-05 | 2019-08-09 | 厦门邑通软件科技有限公司 | A kind of equipment fault intelligent control method based on event graphical spectrum technology |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100324A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Knowledge graph automatic check iteration method based on greedy entity link |
CN113190674A (en) * | 2021-05-08 | 2021-07-30 | 上海明略人工智能(集团)有限公司 | Method, device, electronic equipment and readable storage medium for generating event context |
CN114064937A (en) * | 2022-01-14 | 2022-02-18 | 云孚科技(北京)有限公司 | Method and system for automatically constructing case map |
CN114880491A (en) * | 2022-07-08 | 2022-08-09 | 云孚科技(北京)有限公司 | Method and system for automatically constructing case map |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110727803A (en) | Text event extraction method and device | |
CN106649742B (en) | Database maintenance method and device | |
CN110874531B (en) | Topic analysis method and device and storage medium | |
CN108509477B (en) | Method for recognizing semantics, electronic device and computer readable storage medium | |
US10503830B2 (en) | Natural language processing with adaptable rules based on user inputs | |
US10824816B2 (en) | Semantic parsing method and apparatus | |
CN112417846B (en) | Text automatic generation method and device, electronic equipment and storage medium | |
US11556812B2 (en) | Method and device for acquiring data model in knowledge graph, and medium | |
KR20220064016A (en) | Method for extracting construction safety accident based data mining using big data | |
CN110825839B (en) | Association relation analysis method for targets in text information | |
CN111079408B (en) | Language identification method, device, equipment and storage medium | |
CN110765235A (en) | Training data generation method and device, terminal and readable medium | |
CN111125355A (en) | Information processing method and related equipment | |
CN111428503A (en) | Method and device for identifying and processing same-name person | |
CN111966792A (en) | Text processing method and device, electronic equipment and readable storage medium | |
CN111177401A (en) | Power grid free text knowledge extraction method | |
US20200401767A1 (en) | Summary evaluation device, method, program, and storage medium | |
KR102206742B1 (en) | Method and apparatus for representing lexical knowledge graph from natural language text | |
CN109300550B (en) | Medical data relation mining method and device | |
CN114969385B (en) | Knowledge graph optimization method and device based on document attribute assignment entity weight | |
Ishii et al. | Causal network construction to support understanding of news | |
CN115481239A (en) | Social governance document abstract extraction method and device and electronic equipment | |
CN112148838B (en) | Service source object extraction method and device | |
US20120144294A1 (en) | Assisting document creation | |
CN114238654A (en) | Knowledge graph construction method and device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200124 |
|
RJ01 | Rejection of invention patent application after publication |