CN117609518A

CN117609518A - Hierarchical Chinese entity relation extraction method and system for centering structure

Info

Publication number: CN117609518A
Application number: CN202410065908.4A
Authority: CN
Inventors: 甘丽新; 涂伟; 陈敏; 曹瑛; 毕文霞; 饶志华; 刘伟凯; 刘斌; 程琳; 陈英玮; 李蔚洪
Original assignee: Jiangxi Science and Technology Normal University
Current assignee: Jiangxi Science and Technology Normal University
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-27
Anticipated expiration: 2044-01-17
Also published as: CN117609518B

Abstract

The invention provides a hierarchical Chinese entity relation extraction method and system for a centering structure, wherein the method comprises the following steps: obtaining a plurality of pieces of human data from a target platform, carrying out entity recognition on each sentence contained in the preprocessed human data, carrying out centering structure recognition on each sentence in a data set, if an entity exists in the centering structure, carrying out part-of-speech labeling on the nouns in the centering structure, replacing the centering structure in the sentences with a first-level entity to reconstruct a new sentence, carrying out feature extraction on the new sentence, and inputting a feature extraction result into a support vector machine to carry out relation extraction, thereby obtaining an implicit entity relation. The invention can solve the problems of lower extraction accuracy and poorer extraction effect in the prior art when extracting the entity relation of the complex long sentence.

Description

Hierarchical Chinese entity relation extraction method and system for centering structure

Technical Field

The invention relates to the technical field of natural language processing, in particular to a hierarchical Chinese entity relation extraction method and system for a centering structure.

Background

Entity relationship extraction is an important research direction in the field of natural language processing. In the information age, the generation and accumulation of large amounts of text data presents a great challenge to people, and how to extract useful information from these large amounts of data is a problem to be solved urgently. Entity relationship extraction techniques may help us better understand and utilize these text data and provide support for various application scenarios. Meanwhile, at present, the continuous development of artificial intelligence technology, entity relation extraction also becomes one of the basic technologies necessary for the tasks of constructing a knowledge graph, developing an intelligent question-answering system and the like. Through analyzing and mining the relationship among the entities, the relationship and the law among the world everything can be better understood, and the artificial intelligence technology is further promoted to develop forward. Therefore, research entity relation extraction has important significance. The method can promote the development of the natural language processing field, can provide support for other discipline fields (such as travel, medicine, finance, law and the like) and has wide influence in a plurality of application scenes.

In the big data age, a great deal of Chinese tourist attraction humane information and news report information appear on the Internet. The text information in these fields is characterized in that the relationships between related persons/organizations and scenic spots are summarized comprehensively, so that the text information is mostly complex long sentences. In fact, because of the large number of modifiers, clauses and other complex structures in long Chinese sentences, these structures can increase the difficulty of understanding text by the model. Furthermore, there may be multiple entities and relationships in a long sentence, and the locations and order between them may also be different, which makes the model require more inference power to correctly identify them. Thus, long sentences present challenges and difficulties for Chinese entity relationship extraction tasks and impact the performance, efficiency, and color rendering in a particular domain of entity relationship extraction models.

However, the existing method for extracting the relationship between the Chinese entities does not have a related research on extracting the relationship between long sentences. All entities in a sentence are combined in pairs to generate entity pairs, then feature extraction is carried out on the entity pairs from the whole sentence, and finally relation extraction is carried out by utilizing the features.

Thus, existing relation extraction methods add many pseudo-entity pairs into the relation classifier; in addition, the feature extraction stage of the entity pairs does not consider the characteristics that a large number of details and modifiers are contained in long sentences, the lexical, syntactic and semantic features of the related entity pairs are directly extracted from the whole sentences, noise information is easy to extract, and the features are the same for each entity pair and have no universal distinction. This tends to affect the accuracy of entity relationship extraction, reduce the entity relationship extraction performance, and increase more computing resources and time overhead.

Disclosure of Invention

Based on the above, the invention aims to provide a hierarchical Chinese entity relation extraction method and system for a centering structure, so as to solve the problems of lower extraction accuracy and poorer extraction effect in the prior art when extracting entity relations of complex long sentences.

In a first aspect, the present invention provides a hierarchical chinese entity relationship extraction method for a centering structure, applied to a natural language processing platform, where the method includes:

acquiring a plurality of pieces of personal data from a target platform, and preprocessing the personal data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;

performing entity recognition on each sentence contained in the preprocessed humane data to obtain any word segment contained in each sentence and an entity corresponding to the word segment one by one, and screening sentences with at least two entities to form a data set;

performing centering structure identification on each sentence in the data set, so that a head-to-tail connection relationship and a part containing preset words in the sentences are used as a centering structure, and judging whether an entity exists in the centering structure;

if the entity exists in the centering structure, part-of-speech tagging is carried out on nouns in the centering structure, a part-of-speech tagging result comprises a modified word and a modified word, and the entity is defined as a primary entity and/or a secondary entity according to the part-of-speech tagging result;

replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a level of characteristics of the new sentence, and extracting a level of characteristics of a fragment containing the second-level entity in the centering structure;

and inputting the first-level feature extraction and the second-level feature extraction results into a support vector machine for relation extraction to obtain an implicit entity relation.

In summary, according to the above-mentioned hierarchical Chinese entity relation extraction method for centering structure, the extraction of Chinese entity relation for long sentence of "with" word centering structure is mainly solved. According to the sentence pattern characteristics and the grammar structure of the centering structure, the centering structure fragments of words in the sentence are extracted as a whole, and entity classification is carried out according to the structure difference of the entity in the sentence. According to the level of the entity, different strategies are adopted to generate entity pairs, so that the quantity and quality of the relation extraction input entity pairs are optimized. In addition, the entities with different levels have very different sensitivity to some characteristics, especially syntax characteristics, so that different characteristic extraction methods are provided for the entities with different levels, a hierarchical relation extraction method is adopted, especially when a hierarchical relation extraction is carried out, the idea of removing branches and leaves and protecting trunks is adopted, long sentences are recombined and simplified, the complexity of the sentence periods is reduced, and more effective characteristic information is captured in the characteristic extraction process. In addition, the invention further provides an implicit relation reasoning rule, and global reasoning is carried out on entities of different levels, so that more rich implicit relation is obtained.

In a preferred embodiment of the present invention, if an entity exists in the centering structure, the step of labeling the part of speech of the noun in the centering structure, where the part of speech labeling result includes a modified word and a modified word, and defining the entity as a primary entity and/or a secondary entity according to the part of speech labeling result includes:

and defining the entity marked as the modified word as a primary entity, and defining the entity marked as the modified word as a secondary entity.

In a preferred embodiment of the present invention, the step of replacing the first level entity with the centering structure in the sentence to reconstruct the new sentence includes:

and deleting all the contents except the first-level entity in the fragments contained in each centering structure with the preset words in the sentence, and inserting the reserved first-level entity into the position of the centering structure in the sentence to obtain a new sentence.

In a preferred embodiment of the present invention, the step of extracting a level of features from the new sentence includes:

and combining all first-level entities in the recombined sentence two by two to generate entity pairs, and extracting lexical and syntactic features from each entity pair, wherein an extraction result comprises an entity type combination, an entity context, an inter-entity distance, a dependency syntactic relation combination and a nearest syntax dependency verb.

In a preferred embodiment of the present invention, the step of performing secondary feature extraction on the segment containing the secondary entity in the centering structure includes:

if a verb exists in the part of the content in front of the preset word in the segment, defining the verb as a verb feature of the entity pair;

if at least two verbs exist in the part of the content of the segment in front of the preset word, the verb with the least number of the words spaced from the secondary entity is selected to be defined as a verb feature.

In a preferred embodiment of the present invention, the step of inputting the results of the first-level feature extraction and the second-level feature extraction into the support vector machine for relationship extraction to obtain the implicit entity relationship includes:

for primary entities in the centering structural fragmente _i Acquiring all and primary entities extracted in a hierarchical relationship extractione _i Primary entity of occurrence relatione _j Relationship typer _k ；

For each secondary entity in the centering structure segmente _it Will be the second level entitye _it With primary entitye _j Composing entity pairs<e _it, e _j >If the first level entitye _i Entity pairs of (2)<e _i, e _j >The relation of (2) is thatr _k Then reason out<e _it, e _j >Also isr _k 。

In a preferred embodiment of the present invention, the steps of obtaining a plurality of pieces of personal data from a target platform and preprocessing the personal data, where the preprocessing includes word segmentation processing and part-of-speech tagging processing include:

performing word segmentation by using an LTP-Cloud platform, matching a word segmentation file with a personal name dictionary and a scenic spot dictionary after word segmentation, and recombining words if the words in the dictionary are separated;

and marking the part of speech of the data file after word segmentation is finished.

In a second aspect, the present invention provides a hierarchical chinese entity relationship extraction system for a centering structure, applied to a natural language processing platform, the system comprising:

the system comprises a personal data acquisition module, a target platform and a processing module, wherein the personal data acquisition module is used for acquiring a plurality of personal data from the target platform and preprocessing the personal data, and the preprocessing comprises word segmentation processing and part-of-speech tagging processing;

the data set construction module is used for carrying out entity recognition on each sentence contained in the preprocessed humane data so as to obtain any word contained in each sentence and an entity corresponding to the word one by one, and screening out sentences with at least two entities to form a data set;

the centering structure identification module is used for carrying out centering structure identification on each sentence in the data set so as to take the part with the head-to-tail connection relation and the preset word in the sentences as a centering structure and judging whether an entity exists in the centering structure;

the part-of-speech tagging module is used for tagging the nouns in the centering structure with parts-of-speech if the entity exists in the centering structure, and defining the entity as a primary entity and/or a secondary entity according to the part-of-speech tagging result, wherein the part-of-speech tagging result comprises a modified word and a modified word;

the feature extraction module is used for replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a first-level feature of the new sentence, and extracting a second-level feature of a segment containing the second-level entity in the centering structure;

and the relation extraction module is used for inputting the first-level feature extraction result and the second-level feature extraction result into the support vector machine for relation extraction to obtain an implicit entity relation.

In a third aspect, the present invention provides a storage medium storing one or more programs that when executed implement a hierarchical chinese entity relationship extraction method for centering structures as described above.

Another aspect of the invention also provides a computer device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is used for realizing the hierarchical Chinese entity relation extraction method facing the centering structure when executing the computer program stored in the memory.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flowchart of a hierarchical Chinese entity relationship extraction method for centering structures according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of implicit relationship reasoning in a first embodiment of the invention;

fig. 3 is a schematic structural diagram of a hierarchical chinese entity relationship extraction system for centering structure according to a second embodiment of the present invention.

The invention will be further described in the following detailed description in conjunction with the above-described figures.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Referring to fig. 1, a flowchart of a hierarchical chinese entity relationship extraction method for centering structures in a first embodiment of the present invention is shown, and is applied to a natural language processing platform, and the method includes steps S01 to S06, wherein:

step S01: acquiring a plurality of pieces of personal data from a target platform, and preprocessing the personal data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;

illustratively, the personal data about the cottage is automatically crawled from the cottage's home website using the scripy toolkit in Python. In order to improve the accuracy of subsequent word segmentation, a common personal name dictionary is downloaded from a dog search word stock. In addition, the scenic spots related to the cottages are downloaded from the cellular tourist network as scenic spot dictionary.

For example, a complex example sentence included in the human data: "the national heritage center of the United nations textbook organization, the national committee of the Chinese textbook organization, the national construction department, the national cultural relics office sponsored the sustainable development and management of travel in the world heritage area" is held in the Lushan mountain in the national institute of study. "

In addition, preprocessing data includes word segmentation, part-of-speech tagging, dependency syntax analysis, and entity identification processes. Firstly, word segmentation is carried out by utilizing a LTP-Cloud platform of Harbin industrial university, word segmentation files are matched with a common name and scenic spot dictionary after word segmentation, and if words in the dictionary are separated, the words are recombined; and then, marking the part of speech of the data file after word segmentation is finished.

Illustratively, the word segmentation and part-of-speech tagging results of the example sentences are as follows:

(national heritage center of textbook organization of united states/p/national committee of textbook organization of China/n,/wp national institutes of construction/ni,/wp national cultural relics office/n sponsor/v/u world heritage travel sustainable development and management international seminar/n at/p mountain/ns augmentation/a holding/v./wp).

Step S02: performing entity recognition on each sentence contained in the preprocessed humane data to obtain any word segment contained in each sentence and an entity corresponding to the word segment one by one, and screening sentences with at least two entities to form a data set;

it should be noted that, because the complex long sentences involved in reality are various, may be in terms of travel, may be in terms of business, etc., based on this, it is necessary to determine the domain involved in the processed dataset first, further determine the entity type that may be involved, and then identify the entity for the sentences according to the entity type.

For example, the example sentences are the field of travel news, and mainly comprise scenic spots, places, characters, organizations, works, activities and other types of entities. And carrying out related entity identification on each sentence according to the determined entity type by adopting an LTP-Cloud platform, forming a data set extracted by entity relation by sentences with two or more entities, and initializing all the entities as primary entities.

Illustratively, entity type (3): organization, location, and activity

Primary entity (6): the national heritage center (organization), the national institutes of textbooks (organization), the national institutes of construction (organization), the national cultural relics bureau (organization), the international seminal emission tourism sustainable development and management international seminar (activity) and the cottage (place) of the world heritage.

Step S03: performing centering structure identification on each sentence in the data set, so that a head-to-tail connection relationship and a part containing preset words in the sentences are used as a centering structure, and judging whether an entity exists in the centering structure;

in this embodiment, the preset word refers to "and mainly identifies the centering structure of the" word "with. For each sentence in the data set, firstly judging whether the sentence contains a word; if so, adopting a LTP-Cloud platform of Harbin university industry university to perform centering structure identification, adopting a dependency syntax analysis technology to find out the part with head-to-tail connection relation as a centering structure, recording the position of the fragment contained in each centering structure in the whole sentence, and storing the fragment contained in each centering structure in the word centering structure as a whole.

Illustratively, the centering structure recognition result of the example sentence is: "national heritage center of United nations textbook organization, national Committee of Chinese textbook organization, national construction department, national cultural relics agency sponsored world heritage tourist sustainable development and management International seminar".

Step S04: if the entity exists in the centering structure, part-of-speech tagging is carried out on nouns in the centering structure, a part-of-speech tagging result comprises a modified word and a modified word, and the entity is defined as a primary entity and/or a secondary entity according to the part-of-speech tagging result;

in the step, for the extracted fragments contained in the character centering structure with the character, firstly judging whether the fragments contain entities or not; if not, deleting directly; if yes, determining which nouns are modifier words and which nouns are modifier words by adopting a part-of-speech tagging technology, and extracting modifier words and modifier words; and counting the number of the entities, and finally classifying according to the number of the entities, wherein the classification flow is that the entity marked as the modified word is defined as a first-level entity, and the entity marked as the modified word is defined as a second-level entity.

Specifically, if the fragment contains only 1 entity, if the entity is a modified word, the entity is a central entity, and is a first-level entity in the whole sentence;

if the fragment contains 2 or more than 2 entities, extracting the primary entity, and modifying the entity serving as the modification word into the secondary entity.

Illustratively, the entity classification results are:

entity (5): the world heritage tourist sustainable development and management international seminar (first-level entity), united nations textbook organization world heritage center (second-level entity), chinese textbook organization national committee (second-level entity), national construction department (second-level entity), and national relic bureau (second-level entity).

Step S05: replacing a centering structure in a sentence with a first-level entity to reconstruct a new sentence, extracting a level of characteristics of the new sentence, and extracting a level of characteristics of a fragment containing the second-level entity in the centering structure;

it should be noted that, in the sentence reorganization process, the idea of removing branches and leaves and reserving trunks is adopted, and for each segment included in each word centering structure in the sentence, other modification components are removed, namely only primary entities are reserved, and then the primary entities are inserted into corresponding positions in the original sentence to reorganize a new sentence. Therefore, the complex long sentence is simplified, noise and redundant information can be reduced, the complexity of a grammar structure is reduced, and the subsequent extraction of the level characteristic is facilitated.

Illustratively, the reassembled sentence is: the international seminar for sustainable development and management of travel in the world is held in the mountain hump.

Further, in a hierarchical feature extraction process, first, pairs of entities are generated for all the first-level entities in the sentence after the recombination. For each entity pair, extracting lexical and syntactic features, specifically: entity type combinations, entity contexts, inter-entity distances, dependency syntax relationship combinations, and nearest syntax dependency verbs.

For example, a first-level entity pair is generated for the first-level entity in the sentence after the recombination, and then the lexical and syntactic features are extracted, where the first-level entity pair is: < international seminar, cottage > for sustainable development and management of travel in the world heritage area.

In the two-level extraction process, it is mainly performed in fragments containing secondary entities in a "word-centering structure. For each "with" word centering structure fragment in a sentence, there are only 1 primary entity, while there are 1 or more secondary entities. Therefore, all the secondary entities and the primary entities in the centering structure fragment are respectively combined pairwise, and relation extraction is carried out on the secondary entities and other entity component entity pairs in the whole sentence without considering, so that the relation extraction of pseudo entity pairs is reduced. The fragments contained in the centering structure are generally short, the syntactic structure is also relatively simple, and one or more verbs are usually present, and the type of the two-level entity relationship is mainly determined by the verbs in the fragments. Therefore, the extraction of verbs in fragments is considered as key features for entity relationship extraction. The extraction process of the verb features is as follows:

if 1 verb exists in the front segment of the word, the verb is used as the verb characteristic of the entity pair;

if a plurality of verbs exist in the front segment of the word, the nearest verb following the secondary entity is selected as the verb feature.

Illustratively, 4 secondary entities in the "national heritage travel sustainable development and management international seminar sponsored by united nations textbook organization world heritage center, national committee of textbook organization, national construction department, national cultural relics agency" are respectively generated with primary entities into secondary entity pairs, which are: < national textbook organization worldwide heritage center, world heritage travel sustainable development and management international seminal emission >, < national institutes national committee, world heritage travel sustainable development and management international seminal emission >, < national construction department, world heritage travel sustainable development and management international seminal emission >, < national relic office, world heritage travel sustainable development and management international seminal emission >, verb feature is: and (5) sponsoring.

Step S06: and inputting the first-level feature extraction and the second-level feature extraction results into a support vector machine for relation extraction to obtain an implicit entity relation.

Specifically, inputting the extracted features of the first-stage entity pair in the previous step into a Support Vector Machine (SVM) for relational prediction, and inputting the extracted features of the second-stage entity pair in the previous step into the Support Vector Machine (SVM) for relational prediction;

generally, for two entities comprising a "word centering structure, one entity is a primary entity, the other entity is a secondary entity, and the two entities are modified and modified, and the secondary entity is a local entity and generally only has a relation with the primary entity in the centering structure, but has no relation with other entities in the whole sentence. However, for some special cases, more implicit relationships can be extracted by reasoning. According to the invention, according to the difference of dependency syntactic paths among 2 entities, global reasoning is carried out on the entity relationship by utilizing the difference of modification ranges among the entities, so that more implicit entity relationships are extracted. The method comprises the following specific steps:

the dependency syntax analysis is adopted to analyze the path structure of the centering structure with the word, and the dependency path structure of the secondary entity and the primary entity is mainly of 2 types:

1) ATT- (SBV) - [ COO ] (wherein ATT is a centering relationship, SBV is a main term relationship, COO is a parallel relationship): the shapes "are" in terms of length "," "in terms of the ratio of the length of the group/the composition of the tissue".

2) ATT- (ADV) - > POB- (COO) (wherein ADV is in-phase relation and POB is in-phase relation): the shape "composed of/led/dispatched/brought-up" "first/representative/leader".

For the above-mentioned class 2 path structure, although the secondary entity modifies the primary entity in the centering structure fragment, in the whole sentence, the secondary entity also participates in all the relationships occurring in the primary entity, so as to infer the implicit relationship between the secondary entity and other primary entities. It should be noted that, in actual extraction, the relationship between two entities is various, and may be selected according to the needs of the user, for example: parent-child relationships, attended access relationships, location relationships, and so forth.

Specifically, referring to FIG. 2, for a primary entity in a centering structure fragmente _i First, all the first-level entities extracted in a hierarchical relation extraction are obtainede _i Primary entity of occurrence relatione _j Relationship typer _k ；

For each secondary entity in the centering structure segmente _it Will bee _it And (3) withe _j Composing entity pairs<e _it, e _j >If the first level entitye _i Entity pairs of (2)<e _i, e _j >The relation of (2) is thatr _k Then can infer<e _it, e _j >Also isr _k 。

For example, for example sentences: 'third-belt' for guiding tour the elderly travel mass accesses the cottage mountain, down the cottage mountain hotel. "

Centering structure: travel group for old people in the third area with three bands for guiding tour in the third area

First-order entity: tourist clusters, cottages and hotels for old people in the third place

Secondary entity: zhang San (Zhang San)

Extracting a hierarchical entity relationship: < travel group of old people in the third place, mountain, visit relation >, < travel group of old people in the third place, hotel in the mountain, living relation >.

Extracting two-level entity relation: < Zhang san, louis old people travel group, tie relation >.

Implicit relation: < Zhang San, lushan, visitor relation >, < Zhang San, lushan hotel, resident relation >.

Because the existing extraction of the Chinese entity relationship is usually carried out from the whole sentence, all entities in the sentence are combined pairwise, so that a plurality of pseudo entity pairs are easy to enter a relationship classifier; in addition, the feature extraction stage of the entity pairs does not consider the sentence pattern structure, particularly the feature that a long sentence contains a great deal of details and modifiers, if related features are directly extracted from the whole sentence, a great deal of noise and redundant information are easily extracted, and the features are the same for each entity pair and have no universal distinction. This tends to affect the accuracy of entity relationship extraction, reduce the entity relationship extraction performance, and increase more computing resources and time overhead.

The invention provides a hierarchical Chinese entity relation extraction method for a centering structure. The method mainly solves the extraction of the Chinese entity relation of the long sentence of the centering structure of the Chinese character with 'shape'. According to the sentence pattern characteristics and the grammar structure of the centering structure, the centering structure fragments of words in the sentence are extracted as a whole, and entity classification is carried out according to the structure difference of the entity in the sentence. According to the level of the entity, different strategies are adopted to generate entity pairs, so that the quantity and quality of the relation extraction input entity pairs are optimized. In addition, the entities with different levels have very different sensitivity to some characteristics, especially syntax characteristics, so that different characteristic extraction methods are provided for the entities with different levels, a hierarchical relation extraction method is adopted, especially when a hierarchical relation extraction is carried out, the idea of removing branches and leaves and protecting trunks is adopted, long sentences are recombined and simplified, the complexity of the sentence periods is reduced, and more effective characteristic information is captured in the characteristic extraction process. In addition, the invention further provides an implicit relation reasoning rule, and global reasoning is carried out on entities of different levels, so that more rich implicit relation is obtained.

Compared with the prior art, the application has the following advantages:

(1) data input optimizing relation extraction

According to the level of the entity, different strategies are adopted to generate entity pairs, so that the generation of a plurality of pseudo entities is reduced, the quantity and quality of the relation extraction input entity pairs are optimized, and the reduction of computing resources and time is facilitated.

(2) Improving relation extraction performance

For the characteristics of entities of different levels, different feature extraction methods are adopted, a hierarchical relation extraction method is provided, and entity relation extraction is limited in the hierarchy to perform relation extraction, so that the performance of relation extraction is improved. Especially, when a hierarchical relation is extracted, the thought of removing branches and leaves and preserving trunks is adopted, long sentences are recombined and simplified, the complexity of sentence patterns is reduced, key information in the sentences is focused more, and more effective characteristic information is captured in the characteristic extraction process.

(3) Acquiring more implicit entity relations

According to different dependency syntax paths among entities, the entities with different levels are interacted by utilizing different ranges of modification actions among the entities, so that entity relationships of different levels are extracted and established to be connected, global reasoning of the entity relationships is realized, and more implicit entity relationships are extracted.

Referring to fig. 3, a schematic structural diagram of a hierarchical chinese entity relationship extraction system for centering structure according to a second embodiment of the present invention is shown, the system includes:

the human data acquisition module 10 is used for acquiring a plurality of human data from a target platform and preprocessing the human data, wherein the preprocessing comprises word segmentation processing and part-of-speech tagging processing;

further, the personal data acquisition module 10 further includes:

the word segmentation execution unit is used for carrying out word segmentation by utilizing the LTP-Cloud platform, matching a word segmentation file with a personal name dictionary and a scenic spot dictionary after word segmentation, and recombining words if the words in the dictionary are separated;

The data set construction module 20 is configured to perform entity recognition on each sentence included in the pre-processed personal data, so as to obtain any word segment included in each sentence and an entity corresponding to the word segment one by one, and screen out sentences with at least two entities to form a data set;

the centering structure recognition module 30 is configured to perform centering structure recognition on each sentence in the dataset, so as to take a part, including a preset word, of the sentences having a head-to-tail connection relationship as a centering structure, and determine whether an entity exists in the centering structure;

the part of speech tagging module 40 is configured to tag a noun in the centering structure with a part of speech if an entity exists in the centering structure, and define the entity as a primary entity and/or a secondary entity according to the part of speech tagging result, where the part of speech tagging result includes a modified word and a modified word;

further, the part-of-speech tagging module 40 further includes:

and the entity definition unit is used for defining the entity marked as the modified word as a primary entity and defining the entity marked as the modified word as a secondary entity.

The feature extraction module 50 is configured to replace a first-level entity with a centering structure in a sentence to reconstruct a new sentence, perform a first-level feature extraction on the new sentence, and perform a second-level feature extraction on a segment containing the second-level entity in the centering structure;

further, the feature extraction module 50 further includes:

the entity inserting unit is used for deleting all the contents except the first-level entity in the fragments contained in each centering structure with the preset words in the sentence, and inserting the reserved first-level entity into the position of the centering structure in the sentence to obtain a new sentence;

a hierarchical feature extraction unit, configured to combine all primary entities in the reassembled sentence two by two to generate entity pairs, and extract lexical and syntactic features from each entity pair, where the extraction result includes an entity type combination, an entity context, an inter-entity distance, a dependency syntactic relationship combination, and a nearest syntax dependency verb;

a second-level feature extraction unit, configured to define a verb as a verb feature of the entity pair if a verb exists in a part of the content in front of the preset word in the segment;

if at least two verbs are contained in the part of the segment in front of the preset word, the verb with the least number of the secondary entity interval words is selected to be defined as a verb feature.

The relationship extraction module 60 is configured to input the results of the first-level feature extraction and the second-level feature extraction into the support vector machine for relationship extraction, so as to obtain an implicit entity relationship.

Further, the relation extraction module 60 further includes:

an entity relationship reasoning unit for centering the first-level entity in the structure segmente _i Acquiring all and primary entities extracted in a hierarchical relationship extractione _i Primary entity of occurrence relatione _j Relationship typer _k ；

For each secondary entity e in the centering structure segment _it Will be the secondary entity e _it And primary entity e _j Composing entity pairs<e _it, e _j >If the first level entity e _i Entity of (2)For a pair of<e _i, e _j >Is of the relation r _k Then reason out<e _it, e _j >Also r _k 。

The invention also provides a storage medium, wherein one or more programs are stored on the storage medium, and the programs realize the hierarchical Chinese entity relation extraction method facing the centering structure when being executed by a processor.

The invention also provides a computer device, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for executing the computer program stored on the memory so as to realize the hierarchical Chinese entity relation extraction method facing the centering structure.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain or store the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the present invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. A hierarchical Chinese entity relation extraction method oriented to a centering structure is applied to a natural language processing platform, and is characterized by comprising the following steps:

2. The method for extracting hierarchical chinese entity relationship oriented to a centering structure according to claim 1, wherein if an entity exists in the centering structure, performing part-of-speech tagging on a noun in the centering structure, where a part-of-speech tagging result includes a modified term and a modified term, and defining the entity as a primary entity and/or a secondary entity according to the part-of-speech tagging result includes:

3. The method of claim 1, wherein the step of replacing the centering structure in the sentence with the first-level entity to reconstruct a new sentence comprises:

4. A hierarchical chinese entity relationship extraction method for a centering structure as claimed in claim 3, wherein said step of extracting a hierarchical feature of said new sentence comprises:

5. The method for extracting hierarchical chinese entity relationship for a centering structure according to claim 4, wherein said step of extracting two-level features from a segment of said centering structure containing said second-level entity comprises:

6. The method for extracting hierarchical Chinese entity relationship for centering structure according to claim 5, wherein said step of inputting the results of the first-level feature extraction and the second-level feature extraction into a support vector machine for relationship extraction to obtain the implicit entity relationship comprises:

7. The method for extracting hierarchical chinese entity relationship oriented to a centering structure according to claim 1, wherein the steps of obtaining a plurality of pieces of personal data from a target platform, and preprocessing the personal data, where the preprocessing includes word segmentation processing and part-of-speech tagging processing include:

8. A hierarchical chinese entity relationship extraction system for a centering structure, applied to a natural language processing platform, the system comprising:

9. A storage medium storing one or more programs which when executed by a processor implement a hierarchical chinese entity relationship extraction method according to any one of claims 1-7 for a centering structure.

10. A computer device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to implement a hierarchical chinese entity relationship extraction method for a centering structure according to any one of claims 1 to 7 when executing a computer program stored in the memory.