CN114564563A

CN114564563A - End-to-end entity relationship joint extraction method and system based on relationship decomposition

Info

Publication number: CN114564563A
Application number: CN202210166252.6A
Authority: CN
Inventors: 张璇; 高宸; 杜鲲鹏; 农琼; 马秋颖; 袁子豪
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-31

Abstract

The invention discloses an end-to-end entity relationship joint extraction method based on relationship decomposition, which is characterized by comprising the following steps of: data preprocessing, namely converting the entity and relation triples marked in the training set into a vector form according to a dictionary in the BERT model; model training: and (3) carrying out relation classification according to the text vector output by the BERT model, and then fusing the relation characteristics and sentence characteristics for head and tail entity recognition: and (4) decoding the result: and decoding the entity tags identified under different relation categories, and combining the entity tags with the relations to obtain entity relation triples existing in the sentences. By modeling sentence characteristics under different relationships, the extraction problem of overlapping triples in the sentences can be effectively solved, the performance of entity relationship joint extraction is improved, and the method has good practicability.

Description

End-to-end entity relationship joint extraction method and system based on relationship decomposition

Technical Field

The invention relates to deep learning and natural language processing technologies, in particular to an end-to-end entity relationship joint extraction method and system based on relationship decomposition.

Background

Triple extraction is an important component in information extraction, and is to acquire structured knowledge in the form of (head entity, relation, tail entity) from a group of unstructured texts, which is also called entity relation extraction. This is one of the key tasks for constructing knowledge graphs, and is an important basis for other related natural language processing tasks, such as: machine translation, text summarization, recommendation systems, etc.

In the early extraction methods, entity relation extraction is mostly performed in a pipeline-based manner, and in such methods, an extraction task is regarded as two independent subtasks, namely named entity identification and relation classification. This approach has high flexibility and simplifies the process flow, but also has disadvantages including: error accumulation, physical redundancy, and lack of interaction.

In order to overcome the defects of a pipeline extraction mode, entity-relationship joint extraction uses a model to simultaneously extract entities and relationships. Most of the initial joint extraction methods are feature-based models, and these models require complex preprocessing processes and rely on feature extraction tools, so that not only are the processes complicated, but also other errors are easily introduced.

In order to reduce manual feature engineering, end-to-end entity relationship joint extraction is started by using a neural network, and the method is divided into a joint decoding method and a parameter sharing method. The joint decoding method adopts a new labeling strategy to uniformly label the entities and the relations, and changes the original joint learning model of two subtasks related to named entity identification and relation classification into a sequence labeling problem. The parameter sharing method carries out joint learning by sharing coding layer parameters of a joint model so as to realize the mutual dependence between two subtasks. The end-to-end combined extraction method can utilize the interaction information between the entities and the relations, simultaneously extract the entities and classify the relations of the entity pairs, and well solves the problems brought by the pipeline method. However, the conventional entity-relationship joint extraction scheme only considers the case of extracting a triple in a sentence. In practice, as shown in fig. 4, the sentences extracted by us often contain multiple triples, and these triples may also have overlapping entities and relationships.

Disclosure of Invention

The invention aims to: aiming at the existing problems, an end-to-end entity relationship joint extraction method and system based on relationship decomposition are provided, sentence features under different relationships are respectively extracted, an attention mechanism is combined, and a BERT pre-training model is introduced, so that the information of the whole input sentence is fully utilized, the extraction problem of overlapping triples is solved, and the performance of entity relationship joint extraction is improved.

The technical scheme adopted by the invention is as follows:

the invention relates to an end-to-end entity relationship joint extraction method based on relationship decomposition, which comprises the following steps: data preprocessing, namely converting sentences of the entity relationship to be extracted according to a format required by BERT, converting the sentences into a vector form and inputting the vector form as a BERT model; meanwhile, the triple label is converted into a vector form; respectively marking out the relation, the head entity and the tail entity in the sentence;

model training: combining the text vector output by the BERT model with a sentence vector generated by an attention mechanism to obtain final vector representation of a sentence, and carrying out relation classification through a sigmoid function to identify the relation in the sentence; fusing the obtained relation characteristics with sentence characteristics for head and tail entity recognition;

and (4) decoding the result: and decoding the entity tags identified under different relation categories, and combining the entity tags with the relations to obtain entity relation triples existing in the sentences.

Preferably, each tag in the data pre-processing comprises: the relation type contained in the sentence and the position of the entity in the sentence under the corresponding relation type; and generating two groups of sentence marking sequences according to each relationship type, wherein the two groups of sentence marking sequences respectively represent the positions of the head entity and the tail entity in the triple.

Preferably, if the relationship type is one of the predefined relationship types, the relationship type is represented by two labels of 0 and 1; if the current relationship exists in the sentence, marking the subscript of the corresponding relationship as 1, otherwise, marking the subscript of the corresponding relationship as 0; the position of the entity in the corresponding relation type in the sentence is represented by 0 or 1 or 2 according to the correspondence of the head entity and the tail entity to two different labeling sequences, wherein 0 represents that the word at the current position is not a part of the entity, 1 represents that the word is the starting position of the entity, and 2 represents that the word is the ending position of the entity.

Preferably, the specific process of model training includes:

s21: inputting the text vector representation obtained in the data preprocessing stage into a BERT model, coding by adopting the BERT model based on a transform structure, and learning the context information of each word in a sentence;

s22: carrying out global average pooling on word vectors output by the BERT to obtain sentence-level vector representation; introducing an attention mechanism to learn word expressions having key effects on the sentence classifiers, and merging the word expressions and sentence-level vector expressions obtained after global average pooling to obtain final vector expressions of the sentences;

s23: according to the final vector representation of the sentence, carrying out multi-relation classification through a sigmoid function, and identifying the relation contained in the sentence;

s24: after the relation types contained in the sentences are obtained, one relation is randomly selected, and vector representation is obtained according to the relation embedding; combining the specific relationship vector characteristics with the sentence vector representation based on words output by the BERT model, and identifying entities under specific relationships;

s25: and constructing sentence vector representation under a specific relation for the relation identified in each sentence, and carrying out entity identification on the sentence vector.

Preferably, the method further comprises the following steps: and for all training samples, selecting the maximum likelihood function of the samples, training the model through a back propagation algorithm, and updating parameters in the model.

Preferably, in the training process, one relation in sentences and the corresponding triple are randomly selected for training; all triples in the training set are added to the training by extending the round of model training.

Preferably, the relationship vector features are combined with the word-based sentence vector representation output by the BERT model, and a conditional layer normalization method is adopted: conditional level normalization is performed on the word-based sentence vector representation with the relationship vector features as conditions.

Preferably, the relationship embedding is randomly initialized according to predefined relationship categories.

Preferably, in the result decoding stage, the following three matching methods are classified according to the number of the identified head entity and tail entity: if the number of the head entities is 1, matching the head entities with all tail entities; if the number of tail entities is 1, matching the tail entities with all head entities; and if the number of the head entity and the number of the tail entity are both more than 1, matching the head entity and the tail entity according to a nearby matching principle.

The invention relates to an end-to-end entity relation joint extraction system based on relation decomposition, which comprises:

a data preprocessing module: carrying out word segmentation on the entity and relation triples labeled in the training set according to a dictionary in the BERT model to convert the entities and relation triples into a vector form;

a model training module: inputting a vector corresponding to each word in a sentence in a training set into a BERT model, wherein a text vector output by the BERT model is input into a neural network model jointly extracted based on an end-to-end entity relationship of relationship decomposition, and is trained through a back propagation algorithm to obtain a label prediction model;

a result decoding module: inputting sentences needing entity relation extraction into a trained label prediction model, predicting the relation type of the sentences and labels corresponding to each word in the sentences under the corresponding relation; and obtaining entity relationship triples existing in the sentences according to the relationship type tags and the entity tags in the corresponding relationship.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the method solves the problem that the traditional entity relation joint extraction scheme is difficult to solve the problem of overlapping triples in sentences.

2. The invention respectively extracts the sentence characteristics under different relations, introduces a BERT pre-training model by combining an attention mechanism and fully utilizes the information of the whole input sentence.

3. The method and the device improve the performance of entity relation joint extraction and have good practicability.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

fig. 1 is a flowchart of an end-to-end entity relationship joint extraction method based on relationship decomposition according to the present invention.

FIG. 2 is a diagram of an embodiment of a neural network model structure based on end-to-end entity relationship joint extraction of relationship decomposition.

FIG. 3 is a flowchart of a specific labeling process in the embodiment.

FIG. 4 is a diagram illustrating the overlapping of triples in a sentence being extracted.

Detailed Description

All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.

Any feature disclosed in this specification (including any accompanying claims, abstract) may be replaced by alternative features serving equivalent or similar purposes, unless expressly stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

As shown in FIG. 1, the present invention discloses a method for extracting end-to-end entity relationship combination based on relationship decomposition, which comprises the following steps:

data preprocessing, namely converting sentences of the entity relationship to be extracted into a vector form according to a format required by BERT, and taking the vector form as the input of a BERT model; meanwhile, the triple label is converted into a vector form; respectively marking out the relation, the head entity and the tail entity in the sentence;

model training: combining the text vector output by the BERT model with a sentence vector generated by an attention mechanism to obtain final vector representation of a sentence, and carrying out relationship classification through a sigmoid function to identify the relationship in the sentence; fusing the obtained relation characteristics with sentence characteristics for head and tail entity recognition;

The invention relates to an end-to-end entity relationship joint extraction method based on relationship decomposition, which can be used for extracting entity relationship triples of any natural language, wherein entities are not limited to specific texts and include data contents with common characteristics, such as news, microblogs, encyclopedias and the like.

The invention relates to a concrete realization process of an end-to-end entity relation joint extraction method based on relation decomposition, which comprises the following steps: a data preprocessing stage; a model training phase of the end-to-end entity relationship joint extraction neural network model based on relationship decomposition as shown in fig. 2; and matching the predicted relationship category and the entity label sequence to obtain a relationship entity triple stage.

S1: and in the data preprocessing stage, input data are NYT and WebNLG respectively. NYT is a large data set constructed based on the news corpus of New York Times, which contains 66194 sentences and 24 types of relationship categories, wherein 56195 sentences are used as training sets, 4999 sentences are used as verification sets, and the rest 5000 sentences are used as test sets. The data of WebNLG originated from articles in Wikipedia, forming a standard dataset from annotators manual annotations, containing 6222 sentences in total and 246 types of relationship categories.

S11: and converting into a tag sequence according to the triple information given in the labeled corpus. Firstly, the initial sentence is re-divided according to the input requirement of BERT, and the words outside the built-in dictionary of the BERT are split, so as to obtain a new sentence sequence. The new sentence sequence is then vectorized, which is divided into two phases: respectively, a relationship label and an entity label. The relation marking refers to marking all relations existing in the sentence according to predefined relation types and relations contained in the triples. The entity label is divided into a head entity and a tail entity, and two entity label sequences are constructed for each sentence. The words in the sentence are represented by types of 0, 1 and 2 according to the specific positions of the words in the sentence, wherein: 0 indicates that the current word is other words, 1 indicates that the current word belongs to the beginning of the entity, and 2 indicates that the current word belongs to the end of the entity. The specific labeling process is shown in fig. 3.

S2: the model training stage comprises the following specific steps:

s21: and (3) inputting the text vector representation obtained in the data preprocessing stage into a BERT pre-training model, wherein the length of input sentences is unified to max _ len, the sentences with the length smaller than the max _ len are supplemented by filling characters, and the sentences larger than the max _ len are truncated. And (4) coding and learning the text vector through a BERT model to obtain the context information of each word in the sentence, and outputting the vector representation of the sentence. The calculation formula is as follows:

x_t＝Wordpiece(w_i)t∈[1,n],i∈[1,m]

h_t＝BERT(x_t)t∈[1,n]

wherein

And d_ωThe dimension representing the hidden state of BERT.

Then, H ═ H is used₁,h₂,…,h_n]To represent sentence features based on the context level of the word.

S22: global average pooling, root, of word vectors output by BERTA sentence-vector representation is derived from the word-based vector representation. In addition, the attention mechanism learning is introduced to express words having key action on the sentence classifier, and the words and the sentence vector expression obtained after global average pooling are combined to obtain the final vector expression of the sentence. For the k-th sentence input, the effective vector output by the BERT model is expressed by h_tSentence level vector S obtained after global average pooling^hWith sentence vector representation S based on attention mechanism^aCombining to form final sentence level vector representation S as input, and calculating vector representation of relation category labels; calculating according to the obtained vector representation of the relation category label to obtain the probability that the relation in each sentence corresponds to each category label;

the calculation formula is as follows:

S^h＝GlobalAveragePooling(H)

M＝tanh(H)

α＝softmax(ω^TM)

S^a＝Hα^T

S＝concat[S^h,S^a]

wherein

d_ωDimension representing BERT hidden state, d_SRepresenting the dimensions of the sentence vector feature representation, ω being the trained parameter vector, ω^TIs a transpose. The size of omega, alpha and T is d_ω,d_α,T。

S23: and performing multi-classification task on the sentence through a sigmoid function according to the final vector representation of the sentence, thereby identifying the predefined relation category existing in the sentence. The calculation formula is as follows:

v_j＝σ(W₁S+b₁)

wherein

k represents the total number of relationship classes, and σ represents the sigmoid activation function. The function returns a value in the range of 0 to 1, which can be used as a predicate fingerA threshold value for whether a relationship exists is determined. According to the formula, all relationship types contained in the triples in the current sentence can be obtained.

S24: through the previous relation classification module, the relation classes contained in the sentences are obtained. Then, an entity identification module is carried out, and the specific steps are as follows:

s241: performing relation embedding, and generating vector representation Rel [ Rel ] of all relation classes₁,Rel₂,…,Rel_k,]And then acquiring corresponding relation vector representation according to the identified relation category.

S242: the sentence and the particular relationship vector representation are combined to generate a relationship-based sentence vector representation. Finally, entity identification is performed under a specific relationship. It should be noted that during the training process, the sentences are combined with one relationship at a time by randomly extracting the sentences. In the prediction process, sentences are copied according to the relation number in the sentences, and each sentence is combined with different relations. Therefore, the training round needs to be delayed to ensure that all relationships are selected for training. The calculation formula is as follows:

wherein the content of the first and second substances,

k represents the number of categories of the relationship, Rel_jRepresenting the relation vector representation combined with the current sentence, tag _ num represents the number of categories of tags, including three categories of 0, 1 and 2.

And

respectively represent the ith character in the relation Rel_jThe prediction is head entity and tail entity label probability under the condition.

S25: and for all training samples, training the model by maximizing the maximum likelihood function of the samples, and updating parameters in the model. Because the model comprises two parts of relation extraction and entity identification, the training loss of the model also comprises two parts, namely: a relational classification loss function and an entity identification loss function. Wherein, the entity identification loss part comprises a head entity loss and a tail entity loss. Loss of training of the model

(to minimize) the sum of the relationship label and the entity label negative log probability, defined as the prediction distribution, is given by the formula:

s3: and a result decoding stage:

s31: inputting sentences needing entity relationship extraction into the combined extraction model, and identifying the relationship contained in the sentences;

s32: obtaining different relation characteristic vector representations according to the relation quantity and relation embedding predicted in the S31; copying an input sentence, combining different relation characteristic vectors, and identifying entities under different relation categories;

s33: the entity identification is divided into two parts of head entity identification and tail entity identification.

The recognition process defines two rules: 1. head entities and tail entities in the same triple cannot contain each other; the 2-triplet head-to-tail entity length limit cannot be null or exceed 5. And matching according to the number of the head entity and the tail entity to obtain a triple, and realizing the extraction of the overlapped triple.

The scheme shows that the method introduces a BERT pre-training model and an attention mechanism to code sentences aiming at the problem that overlapping triples are difficult to process in entity relationship joint extraction, provides an end-to-end entity relationship joint extraction model based on relationship decomposition, can effectively improve the prediction performance of the overlapping triples, and has good practicability.

The end-to-end entity relationship joint extraction method based on relationship decomposition used by the invention is compared with the prior seven technologies, and is shown in the following table 1.

TABLE 1 comparison of the accuracy, recall and F1 index of the extraction method of the present invention with seven prior art techniques

As seen from Table 1, the method of the present invention achieves the best performance of entity relationship joint extraction in both datasets.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims

1. An end-to-end entity relationship joint extraction method based on relationship decomposition is characterized by comprising the following steps:

data preprocessing, namely converting sentences of the entity relationship to be extracted according to a format required by BERT, converting the sentences into a vector form and inputting the vector form as a BERT model; meanwhile, the triple label is converted into a vector form; respectively marking out the relation, the head entity and the tail entity in the sentence;

2. The method for extracting end-to-end entity relationship joint based on relationship decomposition as claimed in claim 1, wherein each label in data preprocessing comprises: the relation type contained in the sentence and the position of the entity in the sentence under the corresponding relation type; and generating two groups of sentence marking sequences according to each relationship type, wherein the two groups of sentence marking sequences respectively represent the positions of the head entity and the tail entity in the triple.

3. The end-to-end entity relationship joint extraction method based on relationship decomposition as claimed in claim 2, wherein if the relationship type is one of predefined relationship types, the relationship type is represented according to two labels of 0 and 1; if the current relationship exists in the sentence, marking the subscript of the corresponding relationship as 1, otherwise, marking the subscript of the corresponding relationship as 0; the position of the entity in the corresponding relation type in the sentence is represented by 0 or 1 or 2 according to the correspondence of the head entity and the tail entity to two different labeling sequences, wherein 0 represents that the word at the current position is not a part of the entity, 1 represents that the word is the starting position of the entity, and 2 represents that the word is the ending position of the entity.

4. The method for extracting end-to-end entity relationship jointly based on relationship decomposition as claimed in claim 1, wherein the specific process of model training includes:

s21: inputting the text vector representation obtained in the data preprocessing stage into a BERT model, coding by adopting the BERT model based on a transformer structure, and learning the context information of each word in the sentence;

5. The method for extracting jointly end-to-end entity relationship based on relationship decomposition as claimed in claim 4, further comprising: and for all training samples, selecting the maximum likelihood function of the samples, training the model through a back propagation algorithm, and updating parameters in the model.

6. The end-to-end entity relationship joint extraction method based on relationship decomposition as claimed in claim 4, wherein in the training process, one relationship in sentences and the corresponding triple thereof are randomly selected for training; all triples in the training set are added to the training by extending the round of model training.

7. The method of claim 4, wherein the relationship vector features are combined with the word-based sentence vector representation output by the BERT model, and a conditional level normalization method is employed: conditional level normalization is performed on the word-based sentence vector representation with the relationship vector features as conditions.

8. The method for extracting jointly end-to-end entity relationship based on relationship decomposition as claimed in claim 4, wherein relationship embedding is initialized randomly according to predefined relationship categories.

9. The method for extracting end-to-end entity relationship joint based on relationship decomposition as claimed in claim 1, wherein at the stage of result decoding, the following three matching methods are classified according to the number of identified head entity and tail entity: if the number of the head entities is 1, matching the head entities with all tail entities; if the number of tail entities is 1, matching the tail entities with all head entities; and if the number of the head entity and the number of the tail entity are both more than 1, matching the head entity and the tail entity according to a nearby matching principle.

10. An end-to-end entity relationship joint extraction system based on relationship decomposition, comprising: a data preprocessing module: carrying out word segmentation on the entity and relation triples labeled in the training set according to a dictionary in the BERT model to convert the entities and relation triples into a vector form;