CN114357270A

CN114357270A - Method for extracting and pre-labeling entity relationship

Info

Publication number: CN114357270A
Application number: CN202111274804.7A
Authority: CN
Inventors: 胡志强; 马政; 李志鹏; 石珺
Original assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-04-15

Abstract

The invention discloses an extraction and pre-labeling method of entity relations, and relates to the technical field of knowledge maps. Capturing structural data in an information frame on the right side of the related wiki webpage according to the entity key words as head entities; each triplet is formed by the head entity and the tail entity corresponding to each field in the information frame; and matching corresponding sentences in the wiki original text according to the head and tail entities of each triple, if the matching is successful, performing the next step, and if the matching is failed, skipping the triple and starting to process the next entity keyword. Matching the fields corresponding to the relation of the triples with the relation keywords, and finishing the extraction, pre-labeling and determination of entity types and relations of the triples if the matching is successful; if the matching is unsuccessful, the head entity or the tail entity is pre-labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual work. The invention solves the problems of time and labor waste and high cost of manual labeling.

Description

Method for extracting and pre-labeling entity relationship

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to an extraction and pre-labeling method of entity relationships.

Background

In addition to the advancement of the algorithm itself, data is also a very important factor for deep learning based entity relationship extraction. For data-driven algorithms, the quality of the labeled data often directly determines the effect of the algorithm. At present, the way of acquiring the labeling data is mainly manual labeling, and the way is time-consuming, labor-consuming and low in efficiency.

The prior art mainly depends on manual marking, and the main steps are to judge that the relation does not exist in a certain sentence, and then mark a head entity and a tail entity respectively. For example, "Beam XX is a prominent architectural designer in recent times, and its father is one of the beams XX of the recent thinking family, XX's modified collar sleeves. "first, it is judged that there is" parent-child relationship "in the sentence, then it is judged that the head entity is" beam XX ", and the tail entity is" beam XX ", that is, (beam XX, parent-child, beam XX); there is also an "occupational relationship," with the head entity being "Beam XX," and the tail entity being "architectural designer," i.e., (Beam XX, occupational, architectural designer). There may be multiple relationships in a sentence, and if the type of relationship is desired, it needs to be labeled one by one. Therefore, the manual labeling speed is very slow, and the labeling cost is quite high.

Through the above analysis, the problems and defects of the prior art are as follows: the manual labeling in the prior art is low in efficiency, and the manual labeling in the prior art is high in cost.

The difficulty in solving the above problems and defects is: a large amount of high-quality training corpora need to be subjected to manual intervention at first, and time and labor are wasted.

The significance of solving the problems and the defects is as follows: a large amount of pre-labeled training data with quality assurance can be obtained, and labeling work can be completed through manual intervention, so that labor cost is greatly saved.

Disclosure of Invention

In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a method for extracting and pre-labeling entity relationships. The technical scheme is as follows:

according to a first aspect of the disclosed embodiments of the present invention, there is provided a method for extracting and pre-labeling entity relationships, including:

acquiring entity keywords and relation keywords which need to be extracted, and capturing structural data in an information frame on the right side of the related wiki webpage according to the entity keywords as head entities;

and automatically and accurately extracting required triples from the structured data in the Wikipedia right information frame, and matching sentences corresponding to the triples.

The method specifically comprises the following steps:

step one, acquiring entity keywords and relationship keywords;

step two, taking the entity key words as head entities to capture structured data in an information frame on the right side of the related wiki webpage;

step three, forming each triple by the head entity and the tail entity corresponding to each field in the information frame; the field is used for judging a relation class;

step four, matching corresponding sentences in the wiki primitive according to head and tail entities of the triples, if the matching is successful, performing the next step, and if the matching is unsuccessful, returning to the step two;

matching each field corresponding to the relation of the triples with the relation key words; if the matching is successful, judging the relation of the triples and the type of the tail entity according to the matched relation class, and finally finishing the extraction and pre-labeling of the head entity and the tail entity of the triples and the determination of the entity type and the relation; if the matching is unsuccessful, all entities are reserved, the head entity or the tail entity is pre-labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual work.

In an embodiment of the present invention, in the first step, the entity class keyword is a target object to be crawled; the relation key words are fields in the right information frame of the Wikipedia, and the fields are collected and classified as the basis for judging the relation and the type of the tail entity.

In an embodiment of the present invention, in the second step, the obtained entity-class keywords are used as each head entity, structured data in the information frame on the right side of the relevant wiki page is crawled, traversal is performed in the relation-class keywords, a tail entity corresponding to the head entity is obtained, and the type of the tail entity is determined according to the relation-class keywords.

In an embodiment of the present invention, after matching corresponding sentences in the wiki original text according to the head and tail entities of the triples, the sentences are saved as a training set.

In an embodiment of the present invention, in the fifth step, after matching each field corresponding to the relationship of each triplet with the relationship-type keyword, the index, the type, and the relationship of the head-tail entity corresponding to each triplet are stored into a corresponding format through a program, as shown in fig. 5, so as to complete the extraction and the pre-labeling.

In an embodiment of the present invention, if there is no sentence corresponding to the complete triple in the whole wiki text, the crawling and matching of the next entity data are started, and the process is repeated until all the entity class keywords are processed.

According to a second aspect of the disclosed embodiments of the present invention, there is provided an extraction and pre-labeling system for entity relationships, comprising:

the keyword acquisition module is used for acquiring entity keywords and relation keywords;

the structured data acquisition module is used for taking the entity key words as head entities and capturing structured data in the information frame on the right side of the related wiki webpage;

the triple composition module is used for forming each triple by the head entity and the tail entity corresponding to each field (used for judging the relation class) in the information frame;

the triple head and tail entity matching module is used for matching corresponding sentences in the wiki primitive according to the head and tail entities of the triples, matching each field with the relationship key words if the matching is successful, and returning to the execution structured data acquisition module if the matching is unsuccessful;

and the matching module of each field and the relation key words is used for matching each field corresponding to the triple relation with the relation key words. If the matching is successful, judging the relation of the triples according to the matched relation class, and finishing the extraction, the pre-labeling and the determination of the entity types and the relation of the triples; if the matching is unsuccessful, all entities are reserved, the head entity or the tail entity is pre-labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual work.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the method starts from the entity keywords, aims at the structured data in the right information frame of the Wikipedia and realizes the automatic and accurate extraction of entity relationship triples by means of the relationship keywords; and sentences corresponding to the triples are matched, so that the problems of time and labor waste and high cost of manual labeling are solved.

The method tests on 2240 grabbed wiki texts, 112708 entities are analyzed and correspond to 26403 sentences, wherein 2407 relations (2407 triples) are analyzed by a program and correspond to 2407 sentences. Run on Intel (R) core (TM) i7-8700CPU @3.20GHz (12CPUs) for 1 hour.

Meanwhile, the manually labeled quote is 0.32-element-one relationship group, and 0.3-element-one entity, i.e. 2 × 0.3+ 0.32-0.92-element for one triple. 2407 triples save 2214.44 elements, 112708 entities save 33812.4 elements.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as disclosed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart of an entity relationship extraction and pre-labeling method according to an embodiment of the present invention.

Fig. 2 is a diagram of an entity class keyword interface according to an embodiment of the present invention.

Fig. 3 is a diagram of a relationship class keyword interface provided in an embodiment of the present invention.

Fig. 4 is a lattice interface diagram of types of head and tail entities corresponding to each triplet provided in the embodiment of the present invention.

Fig. 5 is a format interface diagram of the index corresponding to each triplet saved by the program according to the embodiment of the present invention.

Fig. 6 is a schematic diagram of an entity relationship extraction and pre-labeling system according to an embodiment of the present invention.

In the figure: 1. a keyword acquisition module; 2. a structured data acquisition module; 3. a triplet composition module; 4. A triple head and tail entity matching module; 5. and each field and the relation key word matching module.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

As shown in FIG. 1, the extraction and pre-labeling method for entity relationships provided by the embodiment of the present disclosure is innovative in that structured data in an information frame on the right side of a wiki page is used to solve the problems that extraction of triple of a knowledge graph in a specific industry is difficult and time and money are wasted when labeling data. Therefore, if the triples and the corresponding sentences can be conveniently extracted, the algorithm can be used for pre-labeling.

step one, acquiring entity keywords and relation keywords;

thirdly, forming each triple by the head entity and the tail entity corresponding to each field (used for judging the relation class) in the information frame;

and step five, matching each field corresponding to the relation of the triples with the relation keywords. If the matching is successful, judging the relation of the triples according to the matched relation class, and finishing the extraction, the pre-labeling and the determination of the entity types and the relation of the triples; if the matching is unsuccessful, all entities are reserved, the head entity or the tail entity is pre-labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual judgment.

In a preferred embodiment of the present invention, in the first step, the entity class keyword is a target object to be crawled; the relation key words are fields in the right information frame of the Wikipedia, and the fields are collected and classified as the basis for judging the relation and the type of the tail entity.

In a preferred embodiment of the present invention, in the second step, the obtained entity-type keyword is used as a head entity, and a relevant wiki page is crawled to obtain structured data in an information frame on the right side of the wiki page; each triplet is composed of the head entity and the tail entity corresponding to each field (used for judging the relation class) in the information frame.

In a preferred embodiment of the present invention, after matching corresponding sentences in the wiki original text according to the head and tail entities of the triples, the sentences are saved as a training set.

In an embodiment of the present invention, after matching each field corresponding to the relationship of the above triples with the relationship type keyword in the step five, the index, type, and relationship of the head and tail entities corresponding to each triplet are stored into a corresponding format by a program, as shown in fig. 5, so as to complete the extraction and pre-labeling.

And if the whole wiki text has no sentence completely corresponding to the triple, skipping, starting crawling and matching of next entity data, and repeating the steps until all entity keywords are processed.

The technical solution of the present invention is further described below with reference to specific examples.

Example (b):

the method for extracting and pre-labeling the entity relationship provided by the embodiment of the invention specifically comprises the following steps:

firstly, the entity class keywords (figure 2) and the relation class keywords (figure 3) which need to be crawled are sorted. Wherein the entity class key words are target objects to be crawled; the relation class key words are fields in the information frame on the right side of the Wikipedia, and the fields are collected and classified to be used as a basis for judging the relation and forming the triples.

And secondly, crawling related wiki pages according to the entity class keywords in the (figure 2). Such as the people class: the lie XX is used for crawling data in the right information frame of the related Wikipedia by the keyword 'the lie XX' as a head entity, and the type of the head entity is a character; and traversing in the relation keywords to find that the spouse field appearing in the right information box belongs to the character keyword, so that the entity (Zhao XX) type corresponding to the field is also a character as a tail entity. I.e. (character, li XX, spouse, zhao XX, character).

In the same way, a "political party relationship" can be extracted (people, li XX, political parties, XX, political parties); "educational relationships" (people, li XX, graduate, XX university, school); "nationality relationship" (character, clock XX, nationality, china, country), etc.

And thirdly, according to the obtained triple information, sentences matched with the triple information are searched in the Wikipedia text, and the sentences are stored as a training set. For example, the triplets (li XX, political, XX) and (li XX, native, hebei XX) correspond to sentences:

"Li XX (X month X day of XX year-X month X day of XX year), character XX, Hebei XX person, one of the main creators of XX party. "store the index, type, relation of the head and tail entity corresponding to each triple into corresponding format by program (fig. 4 (partial interface), fig. 5 (partial interface)), complete extraction and pre-labeling. It should be noted that the relationships of all entities in the "relationships" field in the pre-label cannot be included in the "relationships" field. Entities in the "relationships" have the following relationships ("said: 0", political party "said: 3") namely (li XX, political party, XX party), but the "relationships" do not have the "political party" relationship class, because the algorithm labels the entities as much as possible, if the algorithm can judge the relationships among the entities, the relationships are embodied in the "relationships" field, and if the relationships cannot be judged by the next step, the manual workload can be reduced as much as possible, so that the algorithm focuses on the relationship judgment among the entities.

And if the whole wiki text has no sentence completely corresponding to the triple, skipping, and starting the wiki crawling and matching of the next entity class keyword.

And by the circulation, the entity relation triples can be extracted quickly, automatically and accurately, and the development of the pre-labeling algorithm is realized.

As shown in fig. 6, the entity relationship extraction and pre-labeling system provided in the embodiment of the present disclosure includes:

the keyword acquisition module 1 is used for acquiring entity keywords and relationship keywords;

the structured data acquisition module 2 is used for capturing the structured data in the right information frame of the relevant wiki webpage according to the entity key words as head entities;

a triple composition module 3, configured to compose each triple from a head entity and a tail entity corresponding to each field (used for determining a relationship class) in the information frame;

the triple head and tail entity matching module 4 is used for matching corresponding sentences in the wiki primitive according to the head and tail entities of the triples, matching each field with the relationship key words if the matching is successful, and returning to the execution structured data acquisition module if the matching is unsuccessful;

and the matching module 5 for each field and relation key word is used for matching each field corresponding to the triple relation with the relation key word. If the matching is successful, judging the relation of the triples according to the matched relation class, and finishing the extraction, the pre-labeling and the determination of the entity types and the relation of the triples; if the matching is unsuccessful, all entities are reserved, the head entity or the tail entity is pre-labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual work.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure should be limited only by the attached claims.

Claims

1. An extraction and pre-labeling method for entity relationships is characterized in that the extraction and pre-labeling method for entity relationships comprises the following steps:

and automatically extracting required triples from the structured data in the Wikipedia right information frame, and matching sentences corresponding to the triples.

2. The method for extracting and pre-labeling entity relationships according to claim 1, wherein the method for extracting and pre-labeling entity relationships specifically comprises the following steps:

step one, acquiring entity keywords and relationship keywords;

3. The method for extracting and pre-labeling entity relationships according to claim 2, wherein in the first step, the entity class keyword is a target object to be crawled; the relation key words are fields in the right information frame of the Wikipedia, and the fields are collected and classified as the basis for judging the relation and the type of the tail entity.

4. The method for extracting and pre-labeling entity relationships according to claim 2, wherein in the second step, the obtained entity-type keywords are used as the head entities, structured data in the information frame on the right side of the relevant wiki page is crawled, traversal is performed in the relation-type keywords, the tail entities corresponding to the head entities are obtained, and the type of the tail entities is judged according to the relation-type keywords.

5. The method for extracting and pre-labeling entity relationships according to claim 2, wherein the fourth step is to store the sentences as the training set after matching corresponding sentences in the wiki primitive according to the head and tail entities of the triples.

6. The method for extracting and pre-labeling entity relationships according to claim 2, wherein in the fifth step, after matching the fields corresponding to the relationship of each triplet with the relationship keywords, the index, type and relationship of the head and tail entities corresponding to each triplet are stored into a corresponding format by a program, thereby completing the extraction and pre-labeling.

7. The method for extracting and pre-labeling entity relationships according to claim 6, wherein if there is no sentence corresponding to the complete triplet in the whole wiki text, the process skips, starts to crawl and match the next entity data, and repeats this cycle until all the entity-class keywords are processed.

8. An entity relationship extraction and pre-labeling system for implementing the entity relationship extraction and pre-labeling method of any one of claims 1 to 7, wherein the entity relationship extraction and pre-labeling system comprises:

the structured data acquisition module is used for capturing the structured data in the right information frame of the relevant wiki webpage according to the entity key words as head entities;

the triple composition module is used for forming each triple by a head entity and a tail entity corresponding to each field in the information frame;

each field and relation key word matching module is used for matching each field corresponding to the relation of each triple with the relation key word; if the matching is successful, judging the relation of the triples according to the matched relation class, and finishing the marking of the head and tail entities of the triples and the determination of the relation; if the matching is unsuccessful, all entities are reserved, the head entity or the tail entity is labeled, and when the relation between the entities cannot be judged by the relation key words, the judgment of the relation is handed to manual work.

9. A program storage medium for receiving user input, the stored computer program causing an electronic device to perform the method of extracting and pre-annotating entity relationships of any one of claims 1 to 7.

10. An information data processing terminal, characterized in that the information data processing terminal comprises a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the method for extracting and pre-labeling entity relationships according to any one of claims 1 to 7.