CN113553854A

CN113553854A - Entity relation joint extraction method and device

Info

Publication number: CN113553854A
Application number: CN202111096807.6A
Authority: CN
Inventors: 经小川; 刘萱; 杜婉茹; 王潇茵; 李瑞群
Original assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Current assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-10-26
Anticipated expiration: 2041-09-18
Also published as: CN113553854B

Abstract

A joint extraction method and a joint extraction device for disclosing entity relationships are disclosed, wherein the joint extraction method comprises the following steps: acquiring text data; acquiring a first feature sequence of the text data based on a preset model, wherein the first feature sequence comprises a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector comprises a plurality of first feature elements; mapping each first feature vector into a mutual exclusion binary cross tag based on the first feature sequence, and combining all the mutual exclusion binary cross tags into a mutual exclusion binary cross tag set; and performing joint extraction on the entity relation of the text data based on the mutually exclusive binary cross tag set. The combined extraction method not only reduces the propagation error of the relation extraction, but also can effectively solve the problem of the overlapping entity relation.

Description

Entity relation joint extraction method and device

Technical Field

The present disclosure relates generally to the field of natural language processing, and more particularly, to a method and an apparatus for extracting entity relationships jointly.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. For natural language processing, one of the basic studies is Information Extraction (IE). Information Extraction is a process of extracting various types of information such as entities, relations and events from natural language text and forming structured data, the most basic work of which is named entity identification, and the core lies in Extraction of entity relations, namely Relation Extraction (RE).

The entity relationship is typically formalized as a relationship triple T, consisting of two entities E1 and E2 and a relationship Rs between them: t = < E1, Rs, E2 >, e.g. < beijing, capital, china >. The relation extraction aims to extract specific types of entities and relations between entity pairs from unstructured natural language texts, and is a basis and data source of downstream tasks such as knowledge graph construction.

In early studies, the relationship extraction usually employed a pipelined approach, i.e., entity identification was first performed using a named entity identification module, and then relationship classification was performed on each entity pair using a relationship classification module. However, no matter the pipeline method or the similar segmentation model is adopted, the error generated in the initial stage cannot be corrected in the subsequent stage, namely, the propagation error exists widely. In order to solve the problem, recent research aims at performing joint learning on entity identification and relation classification, and simultaneously extracting and utilizing association information between entities and relations by a joint extraction method.

The problem of overlapping entity relationships is mainly divided into two main categories: entity Pair Overlap (EPO) and Single Entity Overlap (SEO). The overlapping entity relationship problem refers to a situation where multiple triples share one or two entities in one sentence, for example, the sentence "i live in first beijing of china" contains three relationship triples: < Beijing, capital, China >, < I, place of birth, > Beijing > and < I, place of birth, China >, wherein < Beijing, capital, China > and < I, place of birth, Beijing > share the entity "Beijing", < I, place of birth, Beijing > and < I, place of birth, China > share the entity "I". However, the existing joint extraction method cannot effectively solve the problem of overlapping entity relationships, and the relationship triples of all shared entities cannot be extracted.

Disclosure of Invention

The present disclosure provides a joint extraction method and a joint extraction device for entity relationships, so as to solve the problem of overlapping entity relationships while reducing propagation errors.

In one general aspect, there is provided a joint extraction method of entity relationships, the joint extraction method including: acquiring text data; acquiring a first feature sequence of the text data based on a preset model, wherein the first feature sequence comprises a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector comprises a plurality of first feature elements; mapping each first feature vector into a mutual exclusion binary cross tag based on the first feature sequence, and combining all the mutual exclusion binary cross tags into a mutual exclusion binary cross tag set; and performing joint extraction on the entity relation of the text data based on the mutually exclusive binary cross tag set.

Optionally, each first feature vector comprises a second number of first feature elements determined based on a first number of predefined predicates in the preset model.

Optionally, the step of mapping each first feature vector to a mutually exclusive binary cross tag based on the first feature sequence comprises: for any one first feature vector, comparing each first feature element of the first feature vector with a first preset threshold; when the first characteristic element is larger than the first preset threshold value, reassigning the first characteristic element to be 1; when the first characteristic element is smaller than or equal to the first preset threshold value, reassigning the first characteristic element to be 0; and mapping the first feature vector into a mutually exclusive binary cross tag based on the reassigned first feature element.

Optionally, the step of jointly extracting the entity relationship of the text data based on the mutually exclusive binary cross tag set includes: determining the position information of a median 1 in each mutually exclusive binary cross marker; based on the position information of the value 1, one-dimensionalizing each mutually exclusive binary cross mark; determining a second characteristic sequence based on a second characteristic element obtained by unidimensionalizing each mutually exclusive binary cross tag; and performing joint extraction on the entity relation of the text data based on the second characteristic sequence.

Optionally, the step of jointly extracting the entity relationship of the text data based on the second feature sequence includes: comparing the second characteristic element with a second preset threshold value; determining that the second feature element is an invalid character element based on the second feature element being less than the second preset threshold; determining that the second feature element is an entity middle character element based on the second feature element being equal to the second preset threshold; determining that the second characteristic element is an entity head and tail character element based on the fact that the second characteristic element is larger than the second preset threshold; and performing joint extraction on the entity relation of the text data based on the entity head and tail character elements.

Optionally, the entity beginning and ending character elements include an entity first character element and an entity ending character element, wherein the entity first character element includes a subject entity first character element and an object entity first character element, and the entity ending character element includes a subject entity ending character element and an object entity ending character element.

Optionally, the step of jointly extracting the entity relationship of the text data based on the entity head and tail character elements includes: determining that the entity head-tail character element is the subject entity first character element based on the entity head-tail character element not being greater than the second preset threshold and a first sum of the first number; determining that the entity beginning and ending character element is the object entity first character element based on the entity beginning and ending character element being greater than the first sum and not greater than the first sum and the first number of second sums; determining that the entity beginning and end character element is the subject entity end character element based on the entity beginning and end character element being greater than the second sum and not greater than the second sum and the first number of third sums; determining that the entity beginning and ending character element is the object entity ending character element based on the entity beginning and ending character element being greater than the third sum and not greater than the third sum and the first number of fourth sums.

Optionally, the step of jointly extracting the entity relationship of the text data based on the entity head and tail character elements further includes: matching the adjacent entity first character element and the entity tail character element with each other based on that the difference value of the adjacent entity first character element and the entity tail character element is 2 times of the first number, thereby extracting the entity of the text data; and determining that the entities corresponding to the adjacent entity initial character elements are entity pairs with the same predicate relation based on the difference value between the adjacent entity initial character elements being a first quantity, so as to perform joint extraction on the entity relations of the text data.

Optionally, the step of jointly extracting the entity relationship of the text data based on the entity head and tail character elements further includes: matching the adjacent subject entity initial character elements and subject entity tail character elements with each other based on that the difference value between the adjacent subject entity initial character elements and the subject entity tail character elements is 2 times of the first number, thereby extracting the subject entities of the text data; matching the adjacent object entity first character elements and the object entity tail character elements with each other based on that the difference value of the adjacent object entity first character elements and the adjacent object entity tail character elements is 2 times of the first number, thereby extracting the object entities of the text data; and determining that the entity corresponding to the adjacent subject entity first character element and the entity corresponding to the object entity first character element are entity pairs with the same predicate relation based on the difference value of the adjacent subject entity first character element and the object entity first character element as a first quantity, so as to perform joint extraction on the entity relation of the text data.

In another general aspect, there is provided a joint abstraction apparatus for entity relationships, the joint abstraction apparatus comprising: a data unit configured to acquire text data; the encoding unit is configured to obtain a first feature sequence of the text data based on a preset model, wherein the first feature sequence comprises a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector comprises a plurality of first feature elements; a mapping unit configured to map each first feature vector into a mutually exclusive binary cross tag based on the first feature sequence, and combine all the mutually exclusive binary cross tags into a mutually exclusive binary cross tag set; and the extraction unit is configured to perform joint extraction on the entity relationship of the text data based on the mutually exclusive binary cross tag set.

Optionally, the mapping unit is configured to: for any one first feature vector, comparing each first feature element of the first feature vector with a first preset threshold; when the first characteristic element is larger than the first preset threshold value, reassigning the first characteristic element to be 1; when the first characteristic element is smaller than or equal to the first preset threshold value, reassigning the first characteristic element to be 0; and mapping the first feature vector into a mutually exclusive binary cross tag based on the reassigned first feature element.

Optionally, the extraction unit is configured to: determining the position information of a median 1 in each mutually exclusive binary cross marker; based on the position information of the value 1, one-dimensionalizing each mutually exclusive binary cross mark; determining a second characteristic sequence based on a second characteristic element obtained by unidimensionalizing each mutually exclusive binary cross tag; and performing joint extraction on the entity relation of the text data based on the second characteristic sequence.

Optionally, the extraction unit is configured to: comparing the second characteristic element with a second preset threshold value; determining that the second feature element is an invalid character element based on the second feature element being less than the second preset threshold; determining that the second feature element is an entity middle character element based on the second feature element being equal to the second preset threshold; determining that the second characteristic element is an entity head and tail character element based on the fact that the second characteristic element is larger than the second preset threshold; and performing joint extraction on the entity relation of the text data based on the entity head and tail character elements.

Optionally, the extraction unit is configured to: determining that the entity head-tail character element is the subject entity first character element based on the entity head-tail character element not being greater than the second preset threshold and a first sum of the first number; determining that the entity beginning and ending character element is the object entity first character element based on the entity beginning and ending character element being greater than the first sum and not greater than the first sum and the first number of second sums; determining that the entity beginning and end character element is the subject entity end character element based on the entity beginning and end character element being greater than the second sum and not greater than the second sum and the first number of third sums; determining that the entity beginning and ending character element is the object entity ending character element based on the entity beginning and ending character element being greater than the third sum and not greater than the third sum and the first number of fourth sums.

Optionally, the extraction unit is configured to: matching the adjacent entity first character element and the entity tail character element with each other based on that the difference value of the adjacent entity first character element and the entity tail character element is 2 times of the first number, thereby extracting the entity of the text data; and determining that the entities corresponding to the adjacent entity initial character elements are entity pairs with the same predicate relation based on the difference value between the adjacent entity initial character elements being a first quantity, so as to perform joint extraction on the entity relations of the text data.

Optionally, the extraction unit is configured to: matching the adjacent subject entity initial character elements and subject entity tail character elements with each other based on that the difference value between the adjacent subject entity initial character elements and the subject entity tail character elements is 2 times of the first number, thereby extracting the subject entities of the text data; matching the adjacent object entity first character elements and the object entity tail character elements with each other based on that the difference value of the adjacent object entity first character elements and the adjacent object entity tail character elements is 2 times of the first number, thereby extracting the object entities of the text data; and determining that the entity corresponding to the adjacent subject entity first character element and the entity corresponding to the object entity first character element are entity pairs with the same predicate relation based on the difference value of the adjacent subject entity first character element and the object entity first character element as a first quantity, so as to perform joint extraction on the entity relation of the text data.

In another general aspect, there is provided a computer readable storage medium storing a computer program, which when executed by a processor implements the method for joint extraction of entity relationships as described above.

In another general aspect, there is provided a computing device, comprising: a processor; and a memory storing a computer program which, when executed by the processor, implements the method of joint extraction of entity relationships as described above.

According to the entity relationship joint extraction method and the entity relationship joint extraction device, the entity pair and the relationship thereof can be jointly extracted from the unstructured text, so that the propagation error of relationship extraction is reduced. In addition, according to the method and the device for jointly extracting entity relationships in the embodiment of the disclosure, each character in the natural language text to be processed is converted into a mutually exclusive binary cross tag, so that a complex overlapped entity extraction problem is converted into a tag prediction problem, the problem of overlapped entity relationships can be effectively solved, and the performance of relationship extraction is improved.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The above and other objects and features of the embodiments of the present disclosure will become more apparent from the following description when taken in conjunction with the accompanying drawings showing the embodiments, wherein.

FIG. 1 is a flow diagram illustrating a method of federated extraction of entity relationships according to an embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating step S103 in fig. 1 according to an embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating step S104 in fig. 1 according to an embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating step S304 in fig. 3 according to an embodiment of the present disclosure.

Fig. 5 is a flowchart illustrating step S405 in fig. 4 according to an embodiment of the present disclosure.

FIG. 6 is a diagram illustrating mutually exclusive binary cross-tags, according to an embodiment of the present disclosure.

FIG. 7 is a block diagram of a federated extraction facility that illustrates entity relationships in accordance with an embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a computing device according to an embodiment of the present disclosure.

Detailed Description

The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, devices, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatus, and/or systems described herein will be apparent to those skilled in the art after reviewing the disclosure of the present application. For example, the order of operations described herein is merely an example, and is not limited to those set forth herein, but may be changed as will become apparent after understanding the disclosure of the present application, except to the extent that operations must occur in a particular order. Moreover, descriptions of features known in the art may be omitted for clarity and conciseness.

The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways to implement the methods, devices, and/or systems described herein, which will be apparent after understanding the disclosure of the present application.

As used herein, the term "and/or" includes any one of the associated listed items and any combination of any two or more.

Although terms such as "first", "second", and "third" may be used herein to describe various elements, components, regions, layers or sections, these elements, components, regions, layers or sections should not be limited by these terms. Rather, these terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section referred to in the examples described herein could also be referred to as a second element, component, region, layer or section without departing from the teachings of the examples.

In the specification, when an element (such as a layer, region or substrate) is described as being "on," "connected to" or "coupled to" another element, it can be directly on, connected to or coupled to the other element or one or more other elements may be present therebetween. In contrast, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there may be no intervening elements present.

The terminology used herein is for the purpose of describing various examples only and is not intended to be limiting of the disclosure. The singular is also intended to include the plural unless the context clearly indicates otherwise. The terms "comprises," "comprising," and "having" specify the presence of stated features, quantities, operations, elements, components, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, quantities, operations, components, elements, and/or combinations thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs after understanding the present disclosure. Unless explicitly defined as such herein, terms (such as those defined in general dictionaries) should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and should not be interpreted in an idealized or overly formal sense.

Further, in the description of the examples, when it is considered that detailed description of well-known related structures or functions will cause a vague explanation of the present disclosure, such detailed description will be omitted.

According to the entity relationship joint extraction method and the entity relationship joint extraction device, the entity pairs and the relationship thereof can be jointly extracted from the unstructured text, and the complicated overlapped entity extraction problem is converted into the mark prediction problem by converting each character in the natural language text to be processed into the mutually exclusive binary cross mark, so that the propagation error of relationship extraction is reduced, the overlapped entity relationship problem can be effectively solved, and the relationship extraction performance is greatly improved.

A joint extraction method and a joint extraction apparatus of entity relationships according to an embodiment of the present disclosure are described in detail below with reference to fig. 1 to 8.

FIG. 1 is a flow chart of a method of joint extraction of entity relationships according to an embodiment of the present disclosure. The method for jointly extracting entity relationships according to the embodiment of the present disclosure can be implemented in a computing device with sufficient computing power.

Referring to fig. 1, in step S101, text data may be acquired. Here, the text data may be unstructured natural language text.

Next, in step S102, a first feature sequence of the text data may be obtained based on a preset model. Here, the preset model may be a pre-trained bert (bidirectional Encoder replication from transformations) model. The first feature sequence may be output by the BERT model by inputting the text data into the BERT model. Further, the first feature sequence includes a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector includes a plurality of first feature elements. Further, each first feature vector includes a second number of first feature elements determined based on the first number N of predefined predicates in the preset model. Specifically, the number of first feature vectors corresponding to each character can be determined by the number of predicate relationships corresponding to the character predicted by the BERT model; in addition, the second number of the first feature elements included in each first feature vector may be 4N +2, and may also be set by those skilled in the art according to the actual situation.

More specifically, the parameters of the pre-trained BERT model can be adjusted by the first number of the predefined predicates in the fine tuning stage, so that each first feature vector in the first feature sequence output by the BERT model comprises the second number of first feature elements.

Next, in step S103, each first feature vector may be mapped into a mutual exclusion binary cross tag based on the first feature sequence, and all the mutual exclusion binary cross tags may be combined into a mutual exclusion binary cross tag set. Step S103 in fig. 1 according to an embodiment of the present disclosure is described below with reference to fig. 2.

Referring to fig. 2, in step S201, each first feature element of any one first feature vector may be compared with a first preset threshold. Here, the first preset threshold may be 1, or may be set by those skilled in the art according to actual situations.

Next, in step S202, when the first feature element is greater than the first preset threshold, the first feature element is reassigned to 1.

In step S203, when the first feature element is smaller than or equal to the first preset threshold, the first feature element is reassigned to 0.

In step S204, the first feature vector is mapped to a mutually exclusive binary cross tag based on the reassigned first feature element. Further, all mutually exclusive binary cross-tags may be combined into a mutually exclusive binary cross-tag set.

Specifically, the mutually exclusive binary cross-tag may be a second feature vector consisting of 0 and 1, where the dimension of the second feature vector may be a second number (i.e., 4N +2 as described above). Further, after mapping each first feature vector to a mutually exclusive binary cross-tag, each character in the text data may correspond to at least one mutually exclusive binary cross-tag.

For example, a first feature vector in the first feature sequence output by the BERT model may be

The first preset threshold may be 1, each first feature element of the first feature vector is compared with the first preset threshold, then the first feature element 8.65 is reassigned to 1, and the remaining first feature elements 0.1 are reassigned to 0, and finally the second feature vector corresponding to the first feature vector and composed of 0 and 1 is obtained

I.e., mutually exclusive binary cross-tags.

Referring back to fig. 1, in step S104, the entity relationship of the text data may be jointly extracted based on the mutually exclusive binary cross tag set. Step S104 in fig. 1 according to an embodiment of the present disclosure is described below with reference to fig. 3.

Referring to fig. 3, in step S301, position information of a value 1 in each mutually exclusive binary cross tag may be determined. Here, the position information of the value 1 may be represented as information on how many bits the value 1 is located at the mutually exclusive binary cross flag. For example, when a value of 1 is located at the first bit of the mutually exclusive binary cross flag, the position information of the value of 1 may be 0; when the value 1 is located at the second bit of the mutually exclusive binary cross flag, the location information of the value 1 may be 1; when the value 1 is located at the N +1 th bit of the mutually exclusive binary cross flag, the position information of the value 1 may be N.

Next, in step S302, each mutually exclusive binary cross-tag may be normalized based on the position information of the value 1. Here, as described above, the dimension of the mutual exclusion binary cross flag may be the second number, and matching of the mutual exclusion binary cross flag may be performed more easily by unifying each of the mutual exclusion binary cross flags.

In step S303, a second feature sequence may be determined based on the second feature element obtained by one-dimensionalizing each mutually exclusive binary cross tag. Here, the second characteristic element may be position information of a value of 1 as described above. Further, after the second feature elements are obtained by one-dimensional quantization of each mutually exclusive binary cross-tag, each character in the text data may correspond to at least one of the second feature elements.

In step S304, the entity relationship of the text data may be jointly extracted based on the second feature sequence. Step S304 in fig. 3 according to an embodiment of the present disclosure is described below with reference to fig. 4.

Referring to fig. 4, in step S401, the second feature element may be compared with a second preset threshold. Here, the second preset threshold may be 1, or may be set by those skilled in the art according to actual situations.

Next, in step S402, it may be determined that the second feature element is an invalid character element based on that the second feature element is smaller than a second preset threshold. Here, the invalid character elements may correspond to characters in the text data that are not related to the relational triples to be extracted.

In step S403, it may be determined that the second feature element is an entity middle character element based on the second feature element being equal to a second preset threshold.

In step S404, it may be determined that the second feature element is an entity head-tail character element based on that the second feature element is greater than a second preset threshold. Here, the entity beginning and ending character elements may include an entity first character element and an entity ending character element. Further, the entity first character element may include a subject entity first character element and an object entity first character element, and the entity tail character element may include a subject entity tail character element and an object entity tail character element.

Next, in step S405, entity relationships of the text data may be jointly extracted based on the entity head and tail character elements. Step S405 in fig. 4 according to an embodiment of the present disclosure is described below with reference to fig. 5.

Referring to fig. 5, in step S501, it may be determined that the entity first and last character elements are subject entity first character elements based on the entity first and last character elements not being greater than a second preset threshold and a first sum of a first number. Specifically, the first sum value may be obtained by summing the second preset threshold value and the first number. Further, the first sum may be N + 1.

Next, in step S502, it may be determined that the entity beginning and ending character element is the object entity first character element based on the entity beginning and ending character element being greater than the first sum and not greater than the first sum and the first number of second sums. In particular, the second sum may be obtained by summing the first sum and the first number. Further, the second sum may be 2N + 1.

In step S503, it may be determined that the entity end character element is the subject entity end character element based on the entity end character element being greater than the second sum and not greater than the second sum and the first number of third sums. In particular, the third sum may be obtained by summing the second sum with the first number. Further, the third sum may be 3N + 1.

In step S504, it may be determined that the entity end character element is the object entity end character element based on the entity end character element being greater than the third sum and not greater than the third sum and the first number of fourth sums. Specifically, the fourth sum may be obtained by summing the third sum and the first number. Further, the fourth sum may be 4N + 1.

Next, in step S505, the neighboring entity first character element and the entity last character element may be matched with each other based on the difference value of the neighboring entity first character element and the entity last character element being 2 times the first number, thereby extracting the entity of the text data. Here, a difference of 2 times the first number between the adjacent entity first character element and the entity last character element may represent two meanings: (1) the difference value between one entity first character element and one entity tail character element is 2 times of the first quantity; (2) the spacing of the one entity first character element from the one entity last character element in the second characteristic sequence is smaller than the spacing of the one entity first character element from the other entity last character elements, the difference of which is also 2 times the first number. Further, the entity first character element and the entity last character element may be matched to each other by recursive matching.

Specifically, the subject entity of the text data may be extracted by matching the adjacent subject entity initial character elements and subject entity tail character elements with each other based on the difference between the adjacent subject entity initial character elements and subject entity tail character elements being 2 times the first number.

Meanwhile, the adjacent object entity first character elements and object entity tail character elements may be matched with each other based on that the difference value between the adjacent object entity first character elements and the object entity tail character elements is 2 times the first number, thereby extracting the object entity of the text data.

Next, in step S506, it may be determined that the entities corresponding to the adjacent entity first character elements are the entity pairs having the same predicate relationship based on that the difference between the adjacent entity first character elements is the first number, so as to perform joint extraction on the entity relationships of the text data. Here, the predicate relationship may be determined by looking up a predefined predicate. Further, a difference between adjacent entity first character elements of a first quantity may represent two meanings: (1) the difference value of one entity first character element and the other entity first character element is a first quantity; (2) the spacing of the one entity initial character element from the other entity initial character element in the second characteristic sequence is smaller than the spacing of the same first number of one entity initial character element from the other entity initial character element by the difference.

Specifically, it may be determined that the entity corresponding to the adjacent subject entity initial character element and the entity corresponding to the object entity initial character element are the pair of entities having the same predicate relationship based on the difference value of the adjacent subject entity initial character element and the object entity initial character element being the first number, so that the entity relationship of the text data is jointly extracted.

Mutually exclusive binary cross-tags according to embodiments of the present disclosure are described in detail below with reference to FIG. 6. FIG. 6 is a diagram illustrating mutually exclusive binary cross-tags, according to an embodiment of the present disclosure.

Referring to fig. 6, each character of the subject entity "certain token" and the object entity "liu certain" corresponds to a mutually exclusive binary cross-tag consisting of 0 and 1.

In particular, in a pre-trained BERT model, each character in the text data may be assigned at least one sequence tag, i.e., BIEO (immediate, End, Other) tag, for indicating the location of each character in the corresponding entity.

For the mutually exclusive binary flag with the dimension of 4N +2 as shown in fig. 6, the first bit from the right to the left can represent the O flag information in the BIEO flag, the second bit from the right to the left can represent the I flag information in the BIEO flag, the third bit from the right to the left to the 2N +2 bit can represent the B flag information in the BIEO flag, and the 2N +3 bit from the right to the left to the 4N +2 bit can represent the E flag information in the BIEO flag.

Further, in the mutex binary cross flag, when the value 1 is located first from right to left, a character corresponding to the mutex binary cross flag may be an invalid character; when the value 1 is second from right to left, the character corresponding to the mutually exclusive binary cross-tag may be an entity middle character; when the value 1 is located from the third to 4N +2 bits from right to left, the character corresponding to the mutual exclusion binary cross flag may be the entity beginning and ending characters.

Referring back to fig. 3, step S304 as described above will be described in detail by the following example.

Example one:

according to the entity relation combined extraction method disclosed by the embodiment of the disclosure, the text data 'a certain title' is a public security subject TV play for a certain practical guide in Liu. The historical subject matter TV play (a certain clear sky) is guided by a certain Li. ", a second signature sequence [0, 14, 1, 112, 0, 0, 63, 1, 161, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 1, 1, 112, 0, 0, 63, 1, 161, 0, 0, 0, 0, 0] is obtained. Here, the first number of predefined predicates N =49, the second feature element 0 representing an invalid character element, and the second feature element 1 representing an entity intermediate character element.

Next, in order from left to right, it may be determined that the second feature element 14 is a subject entity first character element, the second feature element 112 is a subject entity last character element, the second feature element 63 is an object entity first character element, and the second feature element 161 is an object entity last character element. Since the difference between the second feature element 14 and the second feature element 112 is 2N, matching the adjacent second feature element 14 and the second feature element 112 can obtain the subject entities "a certain assignment" and "a certain clear sky". Similarly, the object entities "Liu somebody" and "Li somebody" can be obtained. Further, since the difference between the second feature element 14 and the second feature element 63 is N, matching the adjacent second feature element 14 and the second feature element 63 can determine that the pair of entities having the same predicate relationship "some assignment" and "liu some", and "some clear sky" and "lie some".

Here, the second feature element 14 corresponds to one of the predefined predicates, and it can be known that the second feature element 14 corresponds to "director" in the predefined predicates by searching.

Therefore, according to the method for jointly extracting entity relationships in the embodiment of the present disclosure, based on the text data "< a certain title > is a public security subject tv series of a certain practical director in liu. The historical subject matter TV play (a certain clear sky) is guided by a certain Li. "finally, the relation triplets of < a certain talent, director, Liu certain > and < a certain clear sky, director, Li certain > are obtained.

Example two:

according to the method for jointly extracting entity relationships in the embodiment of the disclosure, based on the text data that "a certain line" is an eastern fantasy novel created by a certain chinese web signing writer in a certain fall style under a certain literature ", a second feature sequence [0, 25, 44, 1, 123, 142, 0, 0, 0, 0, 0, 0, 0, 74, 1, 1, 172, 0, 0, 0, 0, 0, 0, 93, 1, 1, 191, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] is obtained. Here, the first number of predefined predicates N =49, the second feature element 0 representing an invalid character element, and the second feature element 1 representing an entity intermediate character element.

Next, in order from left to right, it may be determined that second feature element 25 and second feature element 44 are subject entity first character elements, second feature element 123 and second feature element 142 are subject entity tail character elements, second feature element 74 and second feature element 93 are object entity first character elements, and second feature element 172 and second feature element 191 are object entity tail character elements. Since the difference between second feature element 25 and second feature element 123 is 2N and second feature element 44 and second feature element 142 is also 2N, matching adjacent second feature element 25 and second feature element 123, and second feature element 44 and second feature element 142, may result in the subject entity "a certain row". Similarly, the object entities "a certain Chinese net" and "a certain autumn wind" can be obtained. Further, since the difference of the second feature element 25 and the second feature element 74 is N, matching the adjacent second feature element 25 and the second feature element 74, the pair of entities "a certain row" and "a certain chinese net" having the same predicate relationship can be determined. Similarly, the entity pair "a certain row" and "a certain autumn wind" may be determined.

Here, the second feature element 25 and the second feature element 44 correspond to two of the predefined predicates, it can be known by the search that the second feature element 25 corresponds to a "web site" in the predefined predicate, and the second feature element 44 corresponds to an "author" in the predefined predicate.

Therefore, according to the combined extraction method of entity relationships in the embodiments of the present disclosure, based on the text data that "a certain line" is an eastern fantasy novel created by a certain chinese web signing writer in a certain autumn wind under a certain literature flag, the relationship triplets of < a certain line, a linked website, a certain chinese web > and < a certain line, writer, a certain autumn wind > are finally obtained, and the problem of overlapping entity relationships is effectively solved.

According to the entity relationship joint extraction method, the entity pairs and the relationship thereof can be jointly extracted from the unstructured text, and the complicated overlapped entity extraction problem is converted into the mark prediction problem by converting each character in the natural language text to be processed into the mutual exclusion binary cross mark, so that the propagation error of relationship extraction is reduced, the overlapped entity relationship problem can be effectively solved, and the performance of relationship extraction is greatly improved.

FIG. 7 is a block diagram of a federated extraction facility that illustrates entity relationships in accordance with an embodiment of the present disclosure. The joint extraction device of entity relationships according to embodiments of the present disclosure may be implemented in a computing device with sufficient computing power.

Referring to fig. 7, a joint extraction apparatus 700 of entity relationships according to an embodiment of the present disclosure may include a data unit 710, an encoding unit 720, a mapping unit 730, and an extraction unit 740.

The data unit 710 may acquire text data.

The encoding unit 720 may obtain a first feature sequence of the text data based on a preset model. Here, the first feature sequence includes a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector includes a plurality of first feature elements.

Optionally, each first feature vector comprises a second number of first feature elements determined based on the first number N of predefined predicates in the preset model.

The mapping unit 730 may map each first feature vector to a mutual exclusion binary cross tag based on the first feature sequence, and combine all the mutual exclusion binary cross tags into a mutual exclusion binary cross tag set.

The mapping unit 730 may further compare each first feature element of any one first feature vector with a first preset threshold; when the first characteristic element is larger than a first preset threshold value, reassigning the first characteristic element to be 1; when the first characteristic element is smaller than or equal to a first preset threshold value, reassigning the first characteristic element to be 0; and mapping the first feature vector into a mutually exclusive binary cross tag based on the reassigned first feature element.

The extracting unit 740 may perform joint extraction on the entity relationship of the text data based on the mutually exclusive binary cross tag set.

The decimation unit 740 may determine location information of a value 1 in each mutually exclusive binary cross-tag; based on the position information of the value 1, one-dimensionalizing each mutually exclusive binary cross mark; determining a second characteristic sequence based on a second characteristic element obtained by unidimensionalizing each mutually exclusive binary cross tag; and performing joint extraction on the entity relation of the text data based on the second characteristic sequence.

The extracting unit 740 may compare the second feature element with a second preset threshold; determining that the second characteristic element is an invalid character element based on the second characteristic element being smaller than a second preset threshold; determining that the second characteristic element is an entity middle character element based on the second characteristic element being equal to a second preset threshold; determining that the second characteristic element is an entity head and tail character element based on the fact that the second characteristic element is larger than a second preset threshold; and performing joint extraction on the entity relation of the text data based on the head and tail character elements of the entity.

Alternatively, the entity beginning and ending character elements may include an entity first character element and an entity ending character element. Further, the entity first character element may include a subject entity first character element and an object entity first character element, and the entity tail character element may include a subject entity tail character element and an object entity tail character element.

The extracting unit 740 may determine that the entity head and tail character elements are subject entity first character elements based on that the entity head and tail character elements are not greater than a second preset threshold and a first sum of a first number; determining that the entity beginning and ending character element is an object entity beginning character element based on the entity beginning and ending character element being greater than the first sum and not greater than the first sum and a second sum of the first number; determining that the entity head-tail character element is a subject entity tail character element based on the entity head-tail character element being greater than the second sum and not greater than the second sum and a first number of third sums; determining that the entity beginning and ending character elements are object entity ending character elements based on the entity beginning and ending character elements being greater than the third sum and not greater than the third sum and the first number of fourth sums.

The extraction unit 740 may match the adjacent entity first character element and the entity last character element with each other based on that the difference value of the adjacent entity first character element and the entity last character element is 2 times the first number, thereby extracting the entity of the text data; and determining that the entities corresponding to the adjacent entity initial character elements are entity pairs with the same predicate relation based on the difference value between the adjacent entity initial character elements being the first number, so as to perform joint extraction on the entity relations of the text data.

The extraction unit 740 may further match the adjacent subject entity initial character element and subject entity tail character element with each other based on that the difference value between the adjacent subject entity initial character element and subject entity tail character element is 2 times the first number, thereby extracting the subject entity of the text data; matching the adjacent object entity first character elements and object entity tail character elements with each other based on that the difference value between the adjacent object entity first character elements and the object entity tail character elements is 2 times of the first number, thereby extracting the object entities of the text data; and determining that the entity corresponding to the adjacent subject entity initial character element and the entity corresponding to the object entity initial character element are entity pairs with the same predicate relation based on the difference value of the adjacent subject entity initial character element and the adjacent object entity initial character element as a first quantity, so as to perform combined extraction on the entity relation of the text data.

Referring to fig. 8, a computing device 800 according to an embodiment of the disclosure may include a processor 810 and a memory 820. Processor 810 may include, but is not limited to, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a microcomputer, a Field Programmable Gate Array (FPGA), a system on a chip (SoC), a microprocessor, an Application Specific Integrated Circuit (ASIC), and the like. The memory 820 stores a computer program to be executed by the processor 810. The memory 820 includes high-speed random access memory and/or a non-volatile computer-readable storage medium. The joint extraction method of entity relationships as described above may be implemented when the processor 810 executes a computer program stored in the memory 820.

The joint extraction method of entity relationships according to embodiments of the present disclosure may be written as a computer program and stored on a computer-readable storage medium. The computer program, when executed by a processor, may implement the method of joint extraction of entity relationships as described above. Examples of computer-readable storage media include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. In one example, the computer program and any associated data, data files, and data structures are distributed across networked computer systems such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

Although a few embodiments of the present disclosure have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the disclosure, the scope of which is defined in the claims and their equivalents.

Claims

1. A joint extraction method of entity relationships is characterized in that the joint extraction method comprises the following steps:

acquiring text data;

acquiring a first feature sequence of the text data based on a preset model, wherein the first feature sequence comprises a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector comprises a plurality of first feature elements;

mapping each first feature vector into a mutual exclusion binary cross tag based on the first feature sequence, and combining all the mutual exclusion binary cross tags into a mutual exclusion binary cross tag set;

and performing joint extraction on the entity relation of the text data based on the mutually exclusive binary cross tag set.

2. The joint extraction method of claim 1, wherein each first feature vector includes a second number of first feature elements determined based on a first number of predefined predicates in the preset model.

3. The joint extraction method of claim 2, wherein the step of mapping each first feature vector to a mutually exclusive binary cross-tag based on the first sequence of features comprises:

for any one first feature vector, comparing each first feature element of the first feature vector with a first preset threshold;

when the first characteristic element is larger than the first preset threshold value, reassigning the first characteristic element to be 1;

when the first characteristic element is smaller than or equal to the first preset threshold value, reassigning the first characteristic element to be 0;

and mapping the first feature vector into a mutually exclusive binary cross tag based on the reassigned first feature element.

4. The joint extraction method of claim 3, wherein the step of jointly extracting entity relationships of the text data based on the mutually exclusive binary cross-tag sets comprises:

determining the position information of a median 1 in each mutually exclusive binary cross marker;

based on the position information of the value 1, one-dimensionalizing each mutually exclusive binary cross mark;

determining a second characteristic sequence based on a second characteristic element obtained by unidimensionalizing each mutually exclusive binary cross tag;

and performing joint extraction on the entity relation of the text data based on the second characteristic sequence.

5. The joint extraction method according to claim 4, wherein the step of performing joint extraction on the entity relationship of the text data based on the second feature sequence comprises:

comparing the second characteristic element with a second preset threshold value;

determining that the second feature element is an invalid character element based on the second feature element being less than the second preset threshold;

determining that the second feature element is an entity middle character element based on the second feature element being equal to the second preset threshold;

determining that the second characteristic element is an entity head and tail character element based on the fact that the second characteristic element is larger than the second preset threshold;

and performing joint extraction on the entity relation of the text data based on the entity head and tail character elements.

6. The joint extraction method of claim 5, wherein the entity beginning and end character elements include an entity first character element and an entity end character element, wherein,

the entity first character element includes a subject entity first character element and an object entity first character element,

the entity tail character elements include a subject entity tail character element and an object entity tail character element.

7. The joint extraction method according to claim 6, wherein the step of performing joint extraction on the entity relationship of the text data based on the entity head-tail character elements comprises:

determining that the entity head-tail character element is the subject entity first character element based on the entity head-tail character element not being greater than the second preset threshold and a first sum of the first number;

determining that the entity beginning and ending character element is the object entity first character element based on the entity beginning and ending character element being greater than the first sum and not greater than the first sum and the first number of second sums;

determining that the entity beginning and end character element is the subject entity end character element based on the entity beginning and end character element being greater than the second sum and not greater than the second sum and the first number of third sums;

determining that the entity beginning and ending character element is the object entity ending character element based on the entity beginning and ending character element being greater than the third sum and not greater than the third sum and the first number of fourth sums.

8. The joint extraction method according to claim 7, wherein the step of performing joint extraction on the entity relationship of the text data based on the entity head and tail character elements further comprises:

matching the adjacent entity first character element and the entity tail character element with each other based on that the difference value of the adjacent entity first character element and the entity tail character element is 2 times of the first number, thereby extracting the entity of the text data;

and determining that the entities corresponding to the adjacent entity initial character elements are entity pairs with the same predicate relation based on the difference value between the adjacent entity initial character elements being a first quantity, so as to perform joint extraction on the entity relations of the text data.

9. The joint extraction method according to claim 7, wherein the step of performing joint extraction on the entity relationship of the text data based on the entity head and tail character elements further comprises:

matching the adjacent subject entity initial character elements and subject entity tail character elements with each other based on that the difference value between the adjacent subject entity initial character elements and the subject entity tail character elements is 2 times of the first number, thereby extracting the subject entities of the text data;

matching the adjacent object entity first character elements and the object entity tail character elements with each other based on that the difference value of the adjacent object entity first character elements and the adjacent object entity tail character elements is 2 times of the first number, thereby extracting the object entities of the text data;

and determining that the entity corresponding to the adjacent subject entity first character element and the entity corresponding to the object entity first character element are entity pairs with the same predicate relation based on the difference value of the adjacent subject entity first character element and the object entity first character element as a first quantity, so as to perform joint extraction on the entity relation of the text data.

10. A device for extracting entity relationships jointly, comprising:

a data unit configured to acquire text data;

the encoding unit is configured to obtain a first feature sequence of the text data based on a preset model, wherein the first feature sequence comprises a plurality of first feature vectors, each character of the text data corresponds to at least one first feature vector, and each first feature vector comprises a plurality of first feature elements;

a mapping unit configured to map each first feature vector into a mutually exclusive binary cross tag based on the first feature sequence, and combine all the mutually exclusive binary cross tags into a mutually exclusive binary cross tag set;

and the extraction unit is configured to perform joint extraction on the entity relationship of the text data based on the mutually exclusive binary cross tag set.

11. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for joint extraction of entity relationships according to any one of claims 1 to 9.

12. A computing device, the computing device comprising:

a processor; and

memory storing a computer program which, when executed by a processor, implements the method of joint extraction of entity relationships of any one of claims 1 to 9.