CN117609518A - Hierarchical Chinese entity relation extraction method and system for centering structure - Google Patents

Hierarchical Chinese entity relation extraction method and system for centering structure Download PDF

Info

Publication number
CN117609518A
CN117609518A CN202410065908.4A CN202410065908A CN117609518A CN 117609518 A CN117609518 A CN 117609518A CN 202410065908 A CN202410065908 A CN 202410065908A CN 117609518 A CN117609518 A CN 117609518A
Authority
CN
China
Prior art keywords
entity
centering structure
sentence
level
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410065908.4A
Other languages
Chinese (zh)
Other versions
CN117609518B (en
Inventor
甘丽新
涂伟
陈敏
曹瑛
毕文霞
饶志华
刘伟凯
刘斌
程琳
陈英玮
李蔚洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Science and Technology Normal University
Original Assignee
Jiangxi Science and Technology Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Science and Technology Normal University filed Critical Jiangxi Science and Technology Normal University
Priority to CN202410065908.4A priority Critical patent/CN117609518B/en
Publication of CN117609518A publication Critical patent/CN117609518A/en
Application granted granted Critical
Publication of CN117609518B publication Critical patent/CN117609518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a hierarchical Chinese entity relation extraction method and system for a centering structure, wherein the method comprises the following steps: obtaining a plurality of pieces of human data from a target platform, carrying out entity recognition on each sentence contained in the preprocessed human data, carrying out centering structure recognition on each sentence in a data set, if an entity exists in the centering structure, carrying out part-of-speech labeling on the nouns in the centering structure, replacing the centering structure in the sentences with a first-level entity to reconstruct a new sentence, carrying out feature extraction on the new sentence, and inputting a feature extraction result into a support vector machine to carry out relation extraction, thereby obtaining an implicit entity relation. The invention can solve the problems of lower extraction accuracy and poorer extraction effect in the prior art when extracting the entity relation of the complex long sentence.

Description

Hierarchical Chinese entity relation extraction method and system for centering structure
Technical Field
The invention relates to the technical field of natural language processing, in particular to a hierarchical Chinese entity relation extraction method and system for a centering structure.
Background
Entity relationship extraction is an important research direction in the field of natural language processing. In the information age, the generation and accumulation of large amounts of text data presents a great challenge to people, and how to extract useful information from these large amounts of data is a problem to be solved urgently. Entity relationship extraction techniques may help us better understand and utilize these text data and provide support for various application scenarios. Meanwhile, at present, the continuous development of artificial intelligence technology, entity relation extraction also becomes one of the basic technologies necessary for the tasks of constructing a knowledge graph, developing an intelligent question-answering system and the like. Through analyzing and mining the relationship among the entities, the relationship and the law among the world everything can be better understood, and the artificial intelligence technology is further promoted to develop forward. Therefore, research entity relation extraction has important significance. The method can promote the development of the natural language processing field, can provide support for other discipline fields (such as travel, medicine, finance, law and the like) and has wide influence in a plurality of application scenes.
In the big data age, a great deal of Chinese tourist attraction humane information and news report information appear on the Internet. The text information in these fields is characterized in that the relationships between related persons/organizations and scenic spots are summarized comprehensively, so that the text information is mostly complex long sentences. In fact, because of the large number of modifiers, clauses and other complex structures in long Chinese sentences, these structures can increase the difficulty of understanding text by the model. Furthermore, there may be multiple entities and relationships in a long sentence, and the locations and order between them may also be different, which makes the model require more inference power to correctly identify them. Thus, long sentences present challenges and difficulties for Chinese entity relationship extraction tasks and impact the performance, efficiency, and color rendering in a particular domain of entity relationship extraction models.
However, the existing method for extracting the relationship between the Chinese entities does not have a related research on extracting the relationship between long sentences. All entities in a sentence are combined in pairs to generate entity pairs, then feature extraction is carried out on the entity pairs from the whole sentence, and finally relation extraction is carried out by utilizing the features.
Thus, existing relation extraction methods add many pseudo-entity pairs into the relation classifier; in addition, the feature extraction stage of the entity pairs does not consider the characteristics that a large number of details and modifiers are contained in long sentences, the lexical, syntactic and semantic features of the related entity pairs are directly extracted from the whole sentences, noise information is easy to extract, and the features are the same for each entity pair and have no universal distinction. This tends to affect the accuracy of entity relationship extraction, reduce the entity relationship extraction performance, and increase more computing resources and time overhead.
Disclosure of Invention
Based on the above, the invention aims to provide a hierarchical Chinese entity relation extraction method and system for a centering structure, so as to solve the problems of lower extraction accuracy and poorer extraction effect in the prior art when extracting entity relations of complex long sentences.
In a first aspect, the present invention provides a hierarchical chinese entity relationship extraction method for a centering structure, applied to a natural language processing platform, where the method includes:
acquiring a plurality of pieces of personal data from a target platform, and preprocessing the personal data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
performing entity recognition on each sentence contained in the preprocessed humane data to obtain any word segment contained in each sentence and an entity corresponding to the word segment one by one, and screening sentences with at least two entities to form a data set;
performing centering structure identification on each sentence in the data set, so that a head-to-tail connection relationship and a part containing preset words in the sentences are used as a centering structure, and judging whether an entity exists in the centering structure;
if the entity exists in the centering structure, part-of-speech tagging is carried out on nouns in the centering structure, a part-of-speech tagging result comprises a modified word and a modified word, and the entity is defined as a primary entity and/or a secondary entity according to the part-of-speech tagging result;
replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a level of characteristics of the new sentence, and extracting a level of characteristics of a fragment containing the second-level entity in the centering structure;
and inputting the first-level feature extraction and the second-level feature extraction results into a support vector machine for relation extraction to obtain an implicit entity relation.
In summary, according to the above-mentioned hierarchical Chinese entity relation extraction method for centering structure, the extraction of Chinese entity relation for long sentence of "with" word centering structure is mainly solved. According to the sentence pattern characteristics and the grammar structure of the centering structure, the centering structure fragments of words in the sentence are extracted as a whole, and entity classification is carried out according to the structure difference of the entity in the sentence. According to the level of the entity, different strategies are adopted to generate entity pairs, so that the quantity and quality of the relation extraction input entity pairs are optimized. In addition, the entities with different levels have very different sensitivity to some characteristics, especially syntax characteristics, so that different characteristic extraction methods are provided for the entities with different levels, a hierarchical relation extraction method is adopted, especially when a hierarchical relation extraction is carried out, the idea of removing branches and leaves and protecting trunks is adopted, long sentences are recombined and simplified, the complexity of the sentence periods is reduced, and more effective characteristic information is captured in the characteristic extraction process. In addition, the invention further provides an implicit relation reasoning rule, and global reasoning is carried out on entities of different levels, so that more rich implicit relation is obtained.
In a preferred embodiment of the present invention, if an entity exists in the centering structure, the step of labeling the part of speech of the noun in the centering structure, where the part of speech labeling result includes a modified word and a modified word, and defining the entity as a primary entity and/or a secondary entity according to the part of speech labeling result includes:
and defining the entity marked as the modified word as a primary entity, and defining the entity marked as the modified word as a secondary entity.
In a preferred embodiment of the present invention, the step of replacing the first level entity with the centering structure in the sentence to reconstruct the new sentence includes:
and deleting all the contents except the first-level entity in the fragments contained in each centering structure with the preset words in the sentence, and inserting the reserved first-level entity into the position of the centering structure in the sentence to obtain a new sentence.
In a preferred embodiment of the present invention, the step of extracting a level of features from the new sentence includes:
and combining all first-level entities in the recombined sentence two by two to generate entity pairs, and extracting lexical and syntactic features from each entity pair, wherein an extraction result comprises an entity type combination, an entity context, an inter-entity distance, a dependency syntactic relation combination and a nearest syntax dependency verb.
In a preferred embodiment of the present invention, the step of performing secondary feature extraction on the segment containing the secondary entity in the centering structure includes:
if a verb exists in the part of the content in front of the preset word in the segment, defining the verb as a verb feature of the entity pair;
if at least two verbs exist in the part of the content of the segment in front of the preset word, the verb with the least number of the words spaced from the secondary entity is selected to be defined as a verb feature.
In a preferred embodiment of the present invention, the step of inputting the results of the first-level feature extraction and the second-level feature extraction into the support vector machine for relationship extraction to obtain the implicit entity relationship includes:
for primary entities in the centering structural fragmente i Acquiring all and primary entities extracted in a hierarchical relationship extractione i Primary entity of occurrence relatione j Relationship typer k
For each secondary entity in the centering structure segmente it Will be the second level entitye it With primary entitye j Composing entity pairs<e it, e j >If the first level entitye i Entity pairs of (2)<e i, e j >The relation of (2) is thatr k Then reason out<e it, e j >Also isr k
In a preferred embodiment of the present invention, the steps of obtaining a plurality of pieces of personal data from a target platform and preprocessing the personal data, where the preprocessing includes word segmentation processing and part-of-speech tagging processing include:
performing word segmentation by using an LTP-Cloud platform, matching a word segmentation file with a personal name dictionary and a scenic spot dictionary after word segmentation, and recombining words if the words in the dictionary are separated;
and marking the part of speech of the data file after word segmentation is finished.
In a second aspect, the present invention provides a hierarchical chinese entity relationship extraction system for a centering structure, applied to a natural language processing platform, the system comprising:
the system comprises a personal data acquisition module, a target platform and a processing module, wherein the personal data acquisition module is used for acquiring a plurality of personal data from the target platform and preprocessing the personal data, and the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
the data set construction module is used for carrying out entity recognition on each sentence contained in the preprocessed humane data so as to obtain any word contained in each sentence and an entity corresponding to the word one by one, and screening out sentences with at least two entities to form a data set;
the centering structure identification module is used for carrying out centering structure identification on each sentence in the data set so as to take the part with the head-to-tail connection relation and the preset word in the sentences as a centering structure and judging whether an entity exists in the centering structure;
the part-of-speech tagging module is used for tagging the nouns in the centering structure with parts-of-speech if the entity exists in the centering structure, and defining the entity as a primary entity and/or a secondary entity according to the part-of-speech tagging result, wherein the part-of-speech tagging result comprises a modified word and a modified word;
the feature extraction module is used for replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a first-level feature of the new sentence, and extracting a second-level feature of a segment containing the second-level entity in the centering structure;
and the relation extraction module is used for inputting the first-level feature extraction result and the second-level feature extraction result into the support vector machine for relation extraction to obtain an implicit entity relation.
In a third aspect, the present invention provides a storage medium storing one or more programs that when executed implement a hierarchical chinese entity relationship extraction method for centering structures as described above.
Another aspect of the invention also provides a computer device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is used for realizing the hierarchical Chinese entity relation extraction method facing the centering structure when executing the computer program stored in the memory.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flowchart of a hierarchical Chinese entity relationship extraction method for centering structures according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of implicit relationship reasoning in a first embodiment of the invention;
fig. 3 is a schematic structural diagram of a hierarchical chinese entity relationship extraction system for centering structure according to a second embodiment of the present invention.
The invention will be further described in the following detailed description in conjunction with the above-described figures.
Detailed Description
In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
Referring to fig. 1, a flowchart of a hierarchical chinese entity relationship extraction method for centering structures in a first embodiment of the present invention is shown, and is applied to a natural language processing platform, and the method includes steps S01 to S06, wherein:
step S01: acquiring a plurality of pieces of personal data from a target platform, and preprocessing the personal data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
illustratively, the personal data about the cottage is automatically crawled from the cottage's home website using the scripy toolkit in Python. In order to improve the accuracy of subsequent word segmentation, a common personal name dictionary is downloaded from a dog search word stock. In addition, the scenic spots related to the cottages are downloaded from the cellular tourist network as scenic spot dictionary.
For example, a complex example sentence included in the human data: "the national heritage center of the United nations textbook organization, the national committee of the Chinese textbook organization, the national construction department, the national cultural relics office sponsored the sustainable development and management of travel in the world heritage area" is held in the Lushan mountain in the national institute of study. "
In addition, preprocessing data includes word segmentation, part-of-speech tagging, dependency syntax analysis, and entity identification processes. Firstly, word segmentation is carried out by utilizing a LTP-Cloud platform of Harbin industrial university, word segmentation files are matched with a common name and scenic spot dictionary after word segmentation, and if words in the dictionary are separated, the words are recombined; and then, marking the part of speech of the data file after word segmentation is finished.
Illustratively, the word segmentation and part-of-speech tagging results of the example sentences are as follows:
(national heritage center of textbook organization of united states/p/national committee of textbook organization of China/n,/wp national institutes of construction/ni,/wp national cultural relics office/n sponsor/v/u world heritage travel sustainable development and management international seminar/n at/p mountain/ns augmentation/a holding/v./wp).
Step S02: performing entity recognition on each sentence contained in the preprocessed humane data to obtain any word segment contained in each sentence and an entity corresponding to the word segment one by one, and screening sentences with at least two entities to form a data set;
it should be noted that, because the complex long sentences involved in reality are various, may be in terms of travel, may be in terms of business, etc., based on this, it is necessary to determine the domain involved in the processed dataset first, further determine the entity type that may be involved, and then identify the entity for the sentences according to the entity type.
For example, the example sentences are the field of travel news, and mainly comprise scenic spots, places, characters, organizations, works, activities and other types of entities. And carrying out related entity identification on each sentence according to the determined entity type by adopting an LTP-Cloud platform, forming a data set extracted by entity relation by sentences with two or more entities, and initializing all the entities as primary entities.
Illustratively, entity type (3): organization, location, and activity
Primary entity (6): the national heritage center (organization), the national institutes of textbooks (organization), the national institutes of construction (organization), the national cultural relics bureau (organization), the international seminal emission tourism sustainable development and management international seminar (activity) and the cottage (place) of the world heritage.
Step S03: performing centering structure identification on each sentence in the data set, so that a head-to-tail connection relationship and a part containing preset words in the sentences are used as a centering structure, and judging whether an entity exists in the centering structure;
in this embodiment, the preset word refers to "and mainly identifies the centering structure of the" word "with. For each sentence in the data set, firstly judging whether the sentence contains a word; if so, adopting a LTP-Cloud platform of Harbin university industry university to perform centering structure identification, adopting a dependency syntax analysis technology to find out the part with head-to-tail connection relation as a centering structure, recording the position of the fragment contained in each centering structure in the whole sentence, and storing the fragment contained in each centering structure in the word centering structure as a whole.
Illustratively, the centering structure recognition result of the example sentence is: "national heritage center of United nations textbook organization, national Committee of Chinese textbook organization, national construction department, national cultural relics agency sponsored world heritage tourist sustainable development and management International seminar".
Step S04: if the entity exists in the centering structure, part-of-speech tagging is carried out on nouns in the centering structure, a part-of-speech tagging result comprises a modified word and a modified word, and the entity is defined as a primary entity and/or a secondary entity according to the part-of-speech tagging result;
in the step, for the extracted fragments contained in the character centering structure with the character, firstly judging whether the fragments contain entities or not; if not, deleting directly; if yes, determining which nouns are modifier words and which nouns are modifier words by adopting a part-of-speech tagging technology, and extracting modifier words and modifier words; and counting the number of the entities, and finally classifying according to the number of the entities, wherein the classification flow is that the entity marked as the modified word is defined as a first-level entity, and the entity marked as the modified word is defined as a second-level entity.
Specifically, if the fragment contains only 1 entity, if the entity is a modified word, the entity is a central entity, and is a first-level entity in the whole sentence;
if the fragment contains 2 or more than 2 entities, extracting the primary entity, and modifying the entity serving as the modification word into the secondary entity.
Illustratively, the entity classification results are:
entity (5): the world heritage tourist sustainable development and management international seminar (first-level entity), united nations textbook organization world heritage center (second-level entity), chinese textbook organization national committee (second-level entity), national construction department (second-level entity), and national relic bureau (second-level entity).
Step S05: replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a level of characteristics of the new sentence, and extracting a level of characteristics of a fragment containing the second-level entity in the centering structure;
it should be noted that, in the sentence reorganization process, the idea of removing branches and leaves and reserving trunks is adopted, and for each segment included in each word centering structure in the sentence, other modification components are removed, namely only primary entities are reserved, and then the primary entities are inserted into corresponding positions in the original sentence to reorganize a new sentence. Therefore, the complex long sentence is simplified, noise and redundant information can be reduced, the complexity of a grammar structure is reduced, and the subsequent extraction of the level characteristic is facilitated.
Illustratively, the reassembled sentence is: the international seminar for sustainable development and management of travel in the world is held in the mountain hump.
Further, in a hierarchical feature extraction process, first, pairs of entities are generated for all the first-level entities in the sentence after the recombination. For each entity pair, extracting lexical and syntactic features, specifically: entity type combinations, entity contexts, inter-entity distances, dependency syntax relationship combinations, and nearest syntax dependency verbs.
For example, a first-level entity pair is generated for the first-level entity in the sentence after the recombination, and then the lexical and syntactic features are extracted, where the first-level entity pair is: < international seminar, cottage > for sustainable development and management of travel in the world heritage area.
In the two-level extraction process, it is mainly performed in fragments containing secondary entities in a "word-centering structure. For each "with" word centering structure fragment in a sentence, there are only 1 primary entity, while there are 1 or more secondary entities. Therefore, all the secondary entities and the primary entities in the centering structure fragment are respectively combined pairwise, and relation extraction is carried out on the secondary entities and other entity component entity pairs in the whole sentence without considering, so that the relation extraction of pseudo entity pairs is reduced. The fragments contained in the centering structure are generally short, the syntactic structure is also relatively simple, and one or more verbs are usually present, and the type of the two-level entity relationship is mainly determined by the verbs in the fragments. Therefore, the extraction of verbs in fragments is considered as key features for entity relationship extraction. The extraction process of the verb features is as follows:
if 1 verb exists in the front segment of the word, the verb is used as the verb characteristic of the entity pair;
if a plurality of verbs exist in the front segment of the word, the nearest verb following the secondary entity is selected as the verb feature.
Illustratively, 4 secondary entities in the "national heritage travel sustainable development and management international seminar sponsored by united nations textbook organization world heritage center, national committee of textbook organization, national construction department, national cultural relics agency" are respectively generated with primary entities into secondary entity pairs, which are: < national textbook organization worldwide heritage center, world heritage travel sustainable development and management international seminal emission >, < national institutes national committee, world heritage travel sustainable development and management international seminal emission >, < national construction department, world heritage travel sustainable development and management international seminal emission >, < national relic office, world heritage travel sustainable development and management international seminal emission >, verb feature is: and (5) sponsoring.
Step S06: and inputting the first-level feature extraction and the second-level feature extraction results into a support vector machine for relation extraction to obtain an implicit entity relation.
Specifically, inputting the extracted features of the first-stage entity pair in the previous step into a Support Vector Machine (SVM) for relational prediction, and inputting the extracted features of the second-stage entity pair in the previous step into the Support Vector Machine (SVM) for relational prediction;
generally, for two entities comprising a "word centering structure, one entity is a primary entity, the other entity is a secondary entity, and the two entities are modified and modified, and the secondary entity is a local entity and generally only has a relation with the primary entity in the centering structure, but has no relation with other entities in the whole sentence. However, for some special cases, more implicit relationships can be extracted by reasoning. According to the invention, according to the difference of dependency syntactic paths among 2 entities, global reasoning is carried out on the entity relationship by utilizing the difference of modification ranges among the entities, so that more implicit entity relationships are extracted. The method comprises the following specific steps:
the dependency syntax analysis is adopted to analyze the path structure of the centering structure with the word, and the dependency path structure of the secondary entity and the primary entity is mainly of 2 types:
1) ATT- (SBV) - [ COO ] (wherein ATT is a centering relationship, SBV is a main term relationship, COO is a parallel relationship): the shapes "are" in terms of length "," "in terms of the ratio of the length of the group/the composition of the tissue".
2) ATT- (ADV) - > POB- (COO) (wherein ADV is in-phase relation and POB is in-phase relation): the shape "composed of/led/dispatched/brought-up" "first/representative/leader".
For the above-mentioned class 2 path structure, although the secondary entity modifies the primary entity in the centering structure fragment, in the whole sentence, the secondary entity also participates in all the relationships occurring in the primary entity, so as to infer the implicit relationship between the secondary entity and other primary entities. It should be noted that, in actual extraction, the relationship between two entities is various, and may be selected according to the needs of the user, for example: parent-child relationships, attended access relationships, location relationships, and so forth.
Specifically, referring to FIG. 2, for a primary entity in a centering structure fragmente i First, all the first-level entities extracted in a hierarchical relation extraction are obtainede i Primary entity of occurrence relatione j Relationship typer k
For each secondary entity in the centering structure segmente it Will bee it And (3) withe j Composing entity pairs<e it, e j >If the first level entitye i Entity pairs of (2)<e i, e j >The relation of (2) is thatr k Then can infer<e it, e j >Also isr k
For example, for example sentences: 'third-belt' for guiding tour the elderly travel mass accesses the cottage mountain, down the cottage mountain hotel. "
Centering structure: travel group for old people in the third area with three bands for guiding tour in the third area
First-order entity: tourist clusters, cottages and hotels for old people in the third place
Secondary entity: zhang San (Zhang San)
Extracting a hierarchical entity relationship: < travel group of old people in the third place, mountain, visit relation >, < travel group of old people in the third place, hotel in the mountain, living relation >.
Extracting two-level entity relation: < Zhang san, louis old people travel group, tie relation >.
Implicit relation: < Zhang San, lushan, visitor relation >, < Zhang San, lushan hotel, resident relation >.
Because the existing extraction of the Chinese entity relationship is usually carried out from the whole sentence, all entities in the sentence are combined pairwise, so that a plurality of pseudo entity pairs are easy to enter a relationship classifier; in addition, the feature extraction stage of the entity pairs does not consider the sentence pattern structure, particularly the feature that a long sentence contains a great deal of details and modifiers, if related features are directly extracted from the whole sentence, a great deal of noise and redundant information are easily extracted, and the features are the same for each entity pair and have no universal distinction. This tends to affect the accuracy of entity relationship extraction, reduce the entity relationship extraction performance, and increase more computing resources and time overhead.
The invention provides a hierarchical Chinese entity relation extraction method for a centering structure. The method mainly solves the extraction of the Chinese entity relation of the long sentence of the centering structure of the Chinese character with 'shape'. According to the sentence pattern characteristics and the grammar structure of the centering structure, the centering structure fragments of words in the sentence are extracted as a whole, and entity classification is carried out according to the structure difference of the entity in the sentence. According to the level of the entity, different strategies are adopted to generate entity pairs, so that the quantity and quality of the relation extraction input entity pairs are optimized. In addition, the entities with different levels have very different sensitivity to some characteristics, especially syntax characteristics, so that different characteristic extraction methods are provided for the entities with different levels, a hierarchical relation extraction method is adopted, especially when a hierarchical relation extraction is carried out, the idea of removing branches and leaves and protecting trunks is adopted, long sentences are recombined and simplified, the complexity of the sentence periods is reduced, and more effective characteristic information is captured in the characteristic extraction process. In addition, the invention further provides an implicit relation reasoning rule, and global reasoning is carried out on entities of different levels, so that more rich implicit relation is obtained.
Compared with the prior art, the application has the following advantages:
(1) data input optimizing relation extraction
According to the level of the entity, different strategies are adopted to generate entity pairs, so that the generation of a plurality of pseudo entities is reduced, the quantity and quality of the relation extraction input entity pairs are optimized, and the reduction of computing resources and time is facilitated.
(2) Improving relation extraction performance
For the characteristics of entities of different levels, different feature extraction methods are adopted, a hierarchical relation extraction method is provided, and entity relation extraction is limited in the hierarchy to perform relation extraction, so that the performance of relation extraction is improved. Especially, when a hierarchical relation is extracted, the thought of removing branches and leaves and preserving trunks is adopted, long sentences are recombined and simplified, the complexity of sentence patterns is reduced, key information in the sentences is focused more, and more effective characteristic information is captured in the characteristic extraction process.
(3) Acquiring more implicit entity relations
According to different dependency syntax paths among entities, the entities with different levels are interacted by utilizing different ranges of modification actions among the entities, so that entity relationships of different levels are extracted and established to be connected, global reasoning of the entity relationships is realized, and more implicit entity relationships are extracted.
Referring to fig. 3, a schematic structural diagram of a hierarchical chinese entity relationship extraction system for centering structure according to a second embodiment of the present invention is shown, the system includes:
the human data acquisition module 10 is used for acquiring a plurality of human data from a target platform and preprocessing the human data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
further, the personal data acquisition module 10 further includes:
the word segmentation execution unit is used for carrying out word segmentation by utilizing the LTP-Cloud platform, matching a word segmentation file with a personal name dictionary and a scenic spot dictionary after word segmentation, and recombining words if the words in the dictionary are separated;
and marking the part of speech of the data file after word segmentation is finished.
The data set construction module 20 is configured to perform entity recognition on each sentence included in the pre-processed personal data, so as to obtain any word segment included in each sentence and an entity corresponding to the word segment one by one, and screen out sentences with at least two entities to form a data set;
the centering structure recognition module 30 is configured to perform centering structure recognition on each sentence in the dataset, so as to take a part, including a preset word, of the sentences having a head-to-tail connection relationship as a centering structure, and determine whether an entity exists in the centering structure;
the part of speech tagging module 40 is configured to tag a noun in the centering structure with a part of speech if an entity exists in the centering structure, and define the entity as a primary entity and/or a secondary entity according to the part of speech tagging result, where the part of speech tagging result includes a modified word and a modified word;
further, the part-of-speech tagging module 40 further includes:
and the entity definition unit is used for defining the entity marked as the modified word as a primary entity and defining the entity marked as the modified word as a secondary entity.
The feature extraction module 50 is configured to replace a first-level entity with a centering structure in a sentence to reconstruct a new sentence, perform a first-level feature extraction on the new sentence, and perform a second-level feature extraction on a segment containing the second-level entity in the centering structure;
further, the feature extraction module 50 further includes:
the entity inserting unit is used for deleting all the contents except the first-level entity in the fragments contained in each centering structure with the preset words in the sentence, and inserting the reserved first-level entity into the position of the centering structure in the sentence to obtain a new sentence;
a hierarchical feature extraction unit, configured to combine all primary entities in the reassembled sentence two by two to generate entity pairs, and extract lexical and syntactic features from each entity pair, where the extraction result includes an entity type combination, an entity context, an inter-entity distance, a dependency syntactic relationship combination, and a nearest syntax dependency verb;
a second-level feature extraction unit, configured to define a verb as a verb feature of the entity pair if a verb exists in a part of the content in front of the preset word in the segment;
if at least two verbs are contained in the part of the segment in front of the preset word, the verb with the least number of the secondary entity interval words is selected to be defined as a verb feature.
The relationship extraction module 60 is configured to input the results of the first-level feature extraction and the second-level feature extraction into the support vector machine for relationship extraction, so as to obtain an implicit entity relationship.
Further, the relation extraction module 60 further includes:
an entity relationship reasoning unit for centering the first-level entity in the structure segmente i Acquiring all and primary entities extracted in a hierarchical relationship extractione i Primary entity of occurrence relatione j Relationship typer k
For each secondary entity e in the centering structure segment it Will be the secondary entity e it And primary entity e j Composing entity pairs<e it, e j >If the first level entity e i Entity of (2)For a pair of<e i, e j >Is of the relation r k Then reason out<e it, e j >Also r k
The invention also provides a storage medium, wherein one or more programs are stored on the storage medium, and the programs realize the hierarchical Chinese entity relation extraction method facing the centering structure when being executed by a processor.
The invention also provides a computer device, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program stored on the memory so as to realize the hierarchical Chinese entity relation extraction method facing the centering structure.
Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain or store the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the present invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims (10)

1. A hierarchical Chinese entity relation extraction method oriented to a centering structure is applied to a natural language processing platform, and is characterized by comprising the following steps:
acquiring a plurality of pieces of personal data from a target platform, and preprocessing the personal data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
performing entity recognition on each sentence contained in the preprocessed humane data to obtain any word segment contained in each sentence and an entity corresponding to the word segment one by one, and screening sentences with at least two entities to form a data set;
performing centering structure identification on each sentence in the data set, so that a head-to-tail connection relationship and a part containing preset words in the sentences are used as a centering structure, and judging whether an entity exists in the centering structure;
if the entity exists in the centering structure, part-of-speech tagging is carried out on nouns in the centering structure, a part-of-speech tagging result comprises a modified word and a modified word, and the entity is defined as a primary entity and/or a secondary entity according to the part-of-speech tagging result;
replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a level of characteristics of the new sentence, and extracting a level of characteristics of a fragment containing the second-level entity in the centering structure;
and inputting the first-level feature extraction and the second-level feature extraction results into a support vector machine for relation extraction to obtain an implicit entity relation.
2. The method for extracting hierarchical chinese entity relationship oriented to a centering structure according to claim 1, wherein if an entity exists in the centering structure, performing part-of-speech tagging on a noun in the centering structure, where a part-of-speech tagging result includes a modified term and a modified term, and defining the entity as a primary entity and/or a secondary entity according to the part-of-speech tagging result includes:
and defining the entity marked as the modified word as a primary entity, and defining the entity marked as the modified word as a secondary entity.
3. The method of claim 1, wherein the step of replacing the centering structure in the sentence with the first-level entity to reconstruct a new sentence comprises:
and deleting all the contents except the first-level entity in the fragments contained in each centering structure with the preset words in the sentence, and inserting the reserved first-level entity into the position of the centering structure in the sentence to obtain a new sentence.
4. A hierarchical chinese entity relationship extraction method for a centering structure as claimed in claim 3, wherein said step of extracting a hierarchical feature of said new sentence comprises:
and combining all first-level entities in the recombined sentence two by two to generate entity pairs, and extracting lexical and syntactic features from each entity pair, wherein an extraction result comprises an entity type combination, an entity context, an inter-entity distance, a dependency syntactic relation combination and a nearest syntax dependency verb.
5. The method for extracting hierarchical chinese entity relationship for a centering structure according to claim 4, wherein said step of extracting two-level features from a segment of said centering structure containing said second-level entity comprises:
if a verb exists in the part of the content in front of the preset word in the segment, defining the verb as a verb feature of the entity pair;
if at least two verbs exist in the part of the content of the segment in front of the preset word, the verb with the least number of the words spaced from the secondary entity is selected to be defined as a verb feature.
6. The method for extracting hierarchical Chinese entity relationship for centering structure according to claim 5, wherein said step of inputting the results of the first-level feature extraction and the second-level feature extraction into a support vector machine for relationship extraction to obtain the implicit entity relationship comprises:
for primary entities in the centering structural fragmente i Acquiring all and primary entities extracted in a hierarchical relationship extractione i Primary entity of occurrence relatione j Relationship typer k
For each secondary entity in the centering structure segmente it Will be the second level entitye it With primary entitye j Composing entity pairs<e it, e j >If the first level entitye i Entity pairs of (2)<e i, e j >The relation of (2) is thatr k Then reason out<e it, e j >Also isr k
7. The method for extracting hierarchical chinese entity relationship oriented to a centering structure according to claim 1, wherein the steps of obtaining a plurality of pieces of personal data from a target platform, and preprocessing the personal data, where the preprocessing includes word segmentation processing and part-of-speech tagging processing include:
performing word segmentation by using an LTP-Cloud platform, matching a word segmentation file with a personal name dictionary and a scenic spot dictionary after word segmentation, and recombining words if the words in the dictionary are separated;
and marking the part of speech of the data file after word segmentation is finished.
8. A hierarchical chinese entity relationship extraction system for a centering structure, applied to a natural language processing platform, the system comprising:
the system comprises a personal data acquisition module, a target platform and a processing module, wherein the personal data acquisition module is used for acquiring a plurality of personal data from the target platform and preprocessing the personal data, and the preprocessing comprises word segmentation processing and part-of-speech tagging processing;
the data set construction module is used for carrying out entity recognition on each sentence contained in the preprocessed humane data so as to obtain any word contained in each sentence and an entity corresponding to the word one by one, and screening out sentences with at least two entities to form a data set;
the centering structure identification module is used for carrying out centering structure identification on each sentence in the data set so as to take the part with the head-to-tail connection relation and the preset word in the sentences as a centering structure and judging whether an entity exists in the centering structure;
the part-of-speech tagging module is used for tagging the nouns in the centering structure with parts-of-speech if the entity exists in the centering structure, and defining the entity as a primary entity and/or a secondary entity according to the part-of-speech tagging result, wherein the part-of-speech tagging result comprises a modified word and a modified word;
the feature extraction module is used for replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a first-level feature of the new sentence, and extracting a second-level feature of a segment containing the second-level entity in the centering structure;
and the relation extraction module is used for inputting the first-level feature extraction result and the second-level feature extraction result into the support vector machine for relation extraction to obtain an implicit entity relation.
9. A storage medium storing one or more programs which when executed by a processor implement a hierarchical chinese entity relationship extraction method according to any one of claims 1-7 for a centering structure.
10. A computer device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to implement a hierarchical chinese entity relationship extraction method for a centering structure according to any one of claims 1 to 7 when executing a computer program stored in the memory.
CN202410065908.4A 2024-01-17 2024-01-17 Hierarchical Chinese entity relation extraction method and system for centering structure Active CN117609518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410065908.4A CN117609518B (en) 2024-01-17 2024-01-17 Hierarchical Chinese entity relation extraction method and system for centering structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410065908.4A CN117609518B (en) 2024-01-17 2024-01-17 Hierarchical Chinese entity relation extraction method and system for centering structure

Publications (2)

Publication Number Publication Date
CN117609518A true CN117609518A (en) 2024-02-27
CN117609518B CN117609518B (en) 2024-04-26

Family

ID=89958136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410065908.4A Active CN117609518B (en) 2024-01-17 2024-01-17 Hierarchical Chinese entity relation extraction method and system for centering structure

Country Status (1)

Country Link
CN (1) CN117609518B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN109670024A (en) * 2018-12-17 2019-04-23 北京百度网讯科技有限公司 Logical expression determines method, apparatus, equipment and medium
CN111274394A (en) * 2020-01-16 2020-06-12 重庆邮电大学 Method, device and equipment for extracting entity relationship and storage medium
WO2020191993A1 (en) * 2019-03-22 2020-10-01 北京语自成科技有限公司 Method for syntactic parsing of natural language
CN113255320A (en) * 2021-05-13 2021-08-13 北京熙紫智数科技有限公司 Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
WO2022134779A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Method, apparatus and device for extracting character action related data, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN109670024A (en) * 2018-12-17 2019-04-23 北京百度网讯科技有限公司 Logical expression determines method, apparatus, equipment and medium
WO2020191993A1 (en) * 2019-03-22 2020-10-01 北京语自成科技有限公司 Method for syntactic parsing of natural language
CN111274394A (en) * 2020-01-16 2020-06-12 重庆邮电大学 Method, device and equipment for extracting entity relationship and storage medium
WO2022134779A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Method, apparatus and device for extracting character action related data, and storage medium
CN113255320A (en) * 2021-05-13 2021-08-13 北京熙紫智数科技有限公司 Entity relation extraction method and device based on syntax tree and graph attention machine mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
甘丽新: "基于句法和语义分析的中文实体关系抽取", 《中国博士学位论文全文数据库信息科技辑》, vol. 2018, no. 01, 15 January 2018 (2018-01-15), pages 138 - 117 *
甘丽新;万常选;刘德喜;钟青;江腾蛟;: "基于句法语义特征的中文实体关系抽取", 计算机研究与发展, no. 02, 15 February 2016 (2016-02-15) *

Also Published As

Publication number Publication date
CN117609518B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
RU2686000C1 (en) Retrieval of information objects using a combination of classifiers analyzing local and non-local signs
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
RU2610241C2 (en) Method and system for text synthesis based on information extracted as rdf-graph using templates
Hovy et al. Question Answering in Webclopedia.
Madabushi et al. Integrating question classification and deep learning for improved answer selection
Kushmerick et al. Adaptive information extraction: Core technologies for information agents
RU2639655C1 (en) System for creating documents based on text analysis on natural language
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
US20140156264A1 (en) Open language learning for information extraction
Karkaletsis et al. Ontology based information extraction from text
US11113470B2 (en) Preserving and processing ambiguity in natural language
CN111382571B (en) Information extraction method, system, server and storage medium
Labusch et al. Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT.
Derici et al. Question analysis for a closed domain question answering system
Huang et al. Query expansion based on statistical learning from code changes
Denis New learning models for robust reference resolution
CN117609518B (en) Hierarchical Chinese entity relation extraction method and system for centering structure
Lima et al. Relation extraction from texts with symbolic rules induced by inductive logic programming
Genest et al. Absum: a knowledge-based abstractive summarizer
CN112487154B (en) Intelligent search method based on natural language
Zolotas et al. Type inference in flexible model-driven engineering using classification algorithms
Seneviratne et al. Inductive logic programming in an agent system for ontological relation extraction
Chiruzzo et al. Building a supertagger for Spanish HPSG
CN113821605B (en) Event extraction method
Glass et al. Hierarchical rule generalisation for speaker identification in fiction books

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant