CN115510855A

CN115510855A - Entity relation joint extraction method of multi-relation word pair label space

Info

Publication number: CN115510855A
Application number: CN202211171497.4A
Authority: CN
Inventors: 王立松; 孙明杰
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2022-12-23

Abstract

The invention discloses a method for extracting entity relations of a multi-relation word pair label space in a combined manner.A input layer receives English training samples or samples in a prediction stage; the Tokenize layer performs Token on the sample sentences received by the input layer according to the word list, and after Bert coding, a Token semantic expression vector and a dictionary recording the initial position of the word in the Token sequence are obtained; the Maxpooling layer is used for performing maximum pooling on Token semantic expression vectors based on the dictionary to obtain semantic vector expression of each word in the sentence; and the joint extraction layer enumerates all word pairs in the sentence, marks labels for the word pairs in all predefined relation spaces, and finally performs joint extraction according to the label characteristics. The invention further improves the effect and efficiency of entity relation combined extraction under the complex relation and provides better guarantee for the bottom layer of natural language processing.

Description

Entity relation joint extraction method of multi-relation word pair label space

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a method for jointly extracting entity relations of a multi-relation word pair label space.

Background

Entity relationship joint extraction is a basic task of natural language processing, and the existing entity relationship joint extraction method has certain limitations. Entity relationship joint extraction aims at extracting all correct triples containing head entities, relationships and tail entities from a sentence of unstructured text. However, in a real scenario, the context information of a sentence is very complex, and the problem of overlapping of different types of triples is involved.

As shown in fig. 1, four triples overlap are given: the triplet is EPO, entity pair overlap, SEO, SOO, head entity and tail entity overlap, and Normal. Most current models have some drawbacks to this extraction of entity relationships in complex contexts. While those more sophisticated models do not deal well with such triple overlap. In general, the existing entity relationship extraction model has the following defects:

at present, most models which best work on the entity relationship joint extraction task can only deal with the situation of simple non-overlapping triples, and the model framework cannot be applied to processing the overlapping triples in a complex context.

For the case of overlapping triples, a plurality of entity relationship extraction researches emerge at present. The method can be mainly divided into two categories, namely a pipeline mode and a combined extraction mode. The streamline mode is respectively finished according to the sequence, firstly named entity recognition is carried out on the sentences, and then relations of the candidate entities are classified pairwise. The pipeline mode has low coupling and low relevance between tasks. This results in poor information interactivity between tasks, problems with error propagation and exposure bias.

And for the joint extraction mode, the method can be subdivided into multi-task learning and single-step single-module extraction. The multi-task learning refers to that an entity relationship joint extraction task is divided into a plurality of sub-modules with related relations to be completed in a cooperative work mode. At present, related methods exist, such as predicting the relationship first, then performing head entity and tail entity identification on the specific relationship, recognizing the head entity first, then predicting the relationship, and finally recognizing the tail entity. Compared with a pipeline mode, the multi-task learning is improved, and certain information interaction can be carried out among modules, for example, the entity identification module has relationship information, and the relationship prediction module has entity information. However, this kind of information interaction is not completely shared between modules, and secondly, as in a pipeline manner, there are still problems of error propagation and exposure bias between modules due to the problem of low coupling between modules.

The best model framework is currently considered by the public to be in the form of a single-step single module. The method is to carry out semantic coding through an independent module and extract all triples in a sentence in a one-step mode. The method completely shares entity and relationship information in the same module, and the influence of error transfer and exposure deviation between the modules can be avoided. The existing single-step single-module model has few researches, and has the problems of low model training and reasoning efficiency and relatively incomplete shared entity information.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for extracting entity relationship of multiple relationship word pair label space in a combined manner, aiming at the defects of the prior art.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

the entity relationship joint extraction method of the label space of the multi-relationship word pair is realized based on an entity relationship joint extraction model, the entity relationship joint extraction model comprises an input layer, a Tokenize layer, a Max posing layer and a joint extraction layer, and the method comprises the following steps:

step 1, an input layer receives English training samples or samples in a prediction stage;

step 2, the Tokenize layer carries out Token transformation on the sample sentences received by the input layer according to a word list, and after Bert coding, token semantic expression vectors and a dictionary for recording the initial positions of the words in a Token sequence are obtained;

and 3, a Max posing layer, based on the dictionary, performing maximum pooling on the Token semantic expression vectors to obtain semantic vector expression of each word in the sentence.

And step 4, enumerating all word pairs in the sentence by the combined extraction layer, scoring the labels of the word pairs in all predefined relationship spaces, and finally performing combined extraction according to the label characteristics.

In order to optimize the technical scheme, the specific measures adopted further comprise:

in the step 2, the Tokenize layer uses the Tokenizer in the PyTorch Keras Bert package to Token the sample sentences received by the input layer according to the vocabulary.

Step 2 above for sentence W = { W = { (W) ₁ ,w ₂ ,...,w _X }，w _i Representing the ith word in the sentence, and obtaining a Token semantic representation vector after Token transformation and Bert coding

Where N represents the number of tokens, t _i Represents the ith token, W _enc Representing the semantic vector after all tokens in the sentence are coded, d is the dimension of the semantic vector, and obtaining a dictionary Index for recording the starting position of the word in the token sequence.

Step 3 above fuses token semantic expression vectors into word vector expression by using Max posing operation, and the formula is:

Index＝[(1,n ₁ ) ₁ ,(n ₁ +1,n ₂ ) ₂ ,...,(n _i ,n) _X ],

index refers to the dictionary of recorded words obtained at the token layer at the beginning of the token sequence;

the slicing operation of the sequence;

Emb _i a vector representation representing the resulting i-th word.

The label strategy adopted in the step 4 is specifically as follows:

for one input sentence sample W = { W = { [ W ] ₁ ,w ₂ ,...,w _X And a set of predefined relationships R = { R = } ₁ ,r ₂ ,...,r _Q }, generating a Q-dimensional label matrix TM ^Q×X×X Where X represents the length of the sentence, r _i Representing the ith relation in the relation set, wherein Q is the total number of the relations;

each dimension of the matrix TM corresponds to one of the relations in R, and each square is provided with a label which is generated by a model and has a specific meaning;

the rows and columns in the matrix represent the head entity and the tail entity respectively;

the decoding is to extract all predicted triplets in the matrix at once according to the specific label meaning.

The label strategy is characterized in that eight kinds of labels are arranged according to the characteristics of the entity length and the alignment mode of the head entity and the tail entity: SS, SMH, SMT, MSH, MST, MMH, MMT, A;

wherein, SS represents that the head entity and the tail entity are both composed of single words;

SMH indicates that the head entity is composed of a single word, the tail entity is composed of a plurality of words, and the current alignment is the head word of the head entity and the tail entity;

SMT indicates that the head entity is composed of a single word, the tail entity is composed of a plurality of words, and the current alignment is the head word of the head entity and the tail word of the tail entity;

MSH indicates that the head entity is composed of a plurality of words, the tail entity is composed of a single word, and the current alignment is the head word of the head entity and the tail entity;

MST indicates that the head entity is composed of a plurality of words, the tail entity is composed of a single word, and the current alignment is the tail word of the head entity and the head word of the tail entity;

MMH represents that the head entity and the tail entity are both composed of a plurality of words, and the current alignment is the head word of the head entity and the tail entity;

MMT represents that a head entity is composed of a plurality of words, a tail entity is composed of a single word, and the current alignment is the head word of the head entity and the tail word of the tail entity or the tail word of the head entity and the tail word of the tail entity;

a indicates an empty label.

The joint extraction layer in the step 4 enumerates all word pairs (Emb) under all predefined relations _i ,Emb _j ) And assigning a high-confidence label to the label to realize decoding.

The joint extraction layer in step 4 above maps the high-dimensional word semantic vector to the low-dimensional entity representation vector by using two low-dimensional multi-layer perceptron MLPs:

h _i ＝MLP ^head (Emb _i ),

t _j ＝MLP ^tail (Emb _j )

wherein, MLP represents a multi-layer perceptron;

is a multi-layer perceptron dimension;

d _e a dimension that is a representation of an entity;

head, tail represent head and tail entities, respectively.

The combined extraction layer in the step 4 scores each word pair under all predefined relations through one-time calculation, and the formula is as follows:

wherein, y (h) _i ,r _q ,t _j ) Is a label marked in the training set;

ReLU denotes activation function;

drop represents a dropout strategy;

is a trainable relational projection parameter matrix;

and 8 represents the number of classified labels.

The invention has the following beneficial effects:

the invention combines the characteristics of the entity and the mapping from the entity to the multi-relation space, establishes the label strategy of the interaction between the entity and the relation and the calculation method of the next modeling of the multi-relation, finally further improves the effect and efficiency of the entity relation joint extraction of the model under the complex relation and provides better guarantee for the bottom layer of the natural language processing.

Drawings

FIG. 1 illustrates an overlapping situation of triples in a conventional entity relationship joint extraction method;

FIG. 2 is a schematic diagram of a method for extracting entity relationships of a multi-relationship word pair label space jointly according to the present invention;

fig. 3 is a label generation result case analysis.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

As shown in fig. 2, the entity relationship joint extraction method of the multi-relationship word pair label space of the present invention is implemented based on an entity relationship joint extraction model, where the entity relationship joint extraction model includes an input layer, a Tokenize layer, a Max posing layer, and a joint extraction layer, and the method includes:

step 1, an input layer receives English training sample sentences or sample language sentences in a prediction stage;

the Tokenize layer uses tokenizers in a PyTorch Keras Bert packet, and the function of the Tokenizer is to perform Token transformation on sample sentences received by the input layer according to a vocabulary. There is also a dictionary that records the starting position of each word in the Token sequence after the sentence has been Tokenize. So that the subsequent Bert pre-training model can carry out the coding of semantic context and the formation of word semantic representation. For experimental contrast fairness, the Bert pre-training model used herein is also at the semantic coding layer.

The layer plays an important role in improving the training and reasoning speed of the whole model and enhancing the information interaction in the joint extraction module.

The joint extraction layer is the most critical step of the task, and the related technology is as follows

Tag policy and decoding process

For one input sentence sample W = { W ₁ ,w ₂ ,...,w _X And a set of predefined relationships R = { R = } ₁ ,r ₂ ,...,r _Q The entity relationship joint extraction model generates a Q-dimensional label matrix TM ^Q×X×X Where X represents the length of the sentence, r _i Representing the ith relationship in the relationship set, and Q is the total number of relationships.

Each dimension of the matrix TM corresponds to a relationship in R, and each square has a label with a particular meaning generated by the model.

The rows and columns in the matrix represent the head and tail entities, respectively.

Decoding is to extract all the predicted triplets in the matrix at once according to the specific label meaning.

The invention sets eight kinds of labels according to the characteristics of the length of the entity and the alignment mode of the head entity and the tail entity: SS, SMH, SMT, MSH, MST, MMH, MMT, A.

S and M in the label indicate that the current entity is composed of a single word and multiple words, respectively.

SS represents that the head entity and the tail entity are both composed of single words;

a indicates an empty label.

With this label system, the characteristics of the entity itself can be fully exploited and decoding facilitated.

During the decoding process, only the alignment positions of the legal word pairs need to be found out in the Q-dimensional matrix TM respectively. A detailed decoding example is shown in fig. 2.

In the dimension of the "Capital" relationship, the word pair (Shijiazhuang, hebei) is found to belong to the tag SMH, and the head entity is Shijiazhuang.

The head word of the tail entity is "Hebei", which needs to be searched backwards along the current line, and when finding the tag SMT is terminated, the tail entity is "Hebei Province".

Triplets (Shijiazhuang, capital, hebei Provision) were extracted.

In the dimension of the relation of "continins", the MSH label is found, and the tail entity is "Shijiazhuang".

The head entity is the first word "Hebei", looking down the current column to find the tag MST, and the head entity is "Hebei Province".

Triplet (Hebei Province, contains, shijiazhuang).

The multi-relation space modeling method comprises the following steps:

in step 2, for sentence W = { W ₁ ,w ₂ ,...,w _X }，w _i Representing the ith word in the sentence, and obtaining a Token semantic representation vector after Token transformation and Bert coding

Where N represents the number of tokens, t _i Denotes the ith token, W _enc Representing the semantic vector after all tokens in the sentence are coded, d is the dimension of the semantic vector, and obtaining a dictionary Index for recording the starting position of the word in the token sequence.

In step 3, the token semantic expression vectors are fused into word vector expression by using Max posing operation, as shown in formula (1):

Index＝[(1,n ₁ ) ₁ ,(n ₁ +1,n ₂ ) ₂ ,...,(n _i ,n) _X ], (1)

the slicing operation of the sequence;

Emb _i a vector representation representing the resulting i-th word.

For an input sentence, the word vector representation is obtained through the above processing.

Enumerating all word pairs (Emb) under all predefined relationships in a joint extraction module of a joint extraction layer _i ,Emb _j ) It is assigned a high confidence label.

Inspired by dependency parsing and knowledge graph representation, the present invention combines the concepts of affine-double attention and HOLE to achieve the intended goal, as shown in equations (2) and (3):

h _i ＝MLP ^head (Emb _i ),

t _j ＝MLP ^tail (Emb _j ) (2)

where MLP denotes a multi-layer perceptron.

Has the dimension of

d _e Dimension for entity representation:

head, tail represent head and tail entities, respectively.

The application of two low-dimensional multi-layered perceptron MLPs to map high-dimensional word semantic vectors to low-dimensional entity representation vectors has two advantages:

first, low-dimensional MLPs mapping can remove interfering information from high-dimensional word semantic vectors.

Secondly, such low-dimensional entity representation vectors can speed up subsequent computations.

Wherein ReLU represents an activation function;

drop represents a dropout strategy;

is a trainable relational projection parameter matrix;

and 8 represents the number of classified labels.

The model will score each word pair under all predefined relationships through one calculation.

Optimizing the objective function using equation (4):

wherein, y (h) _i ,r _q ,t _j ) Are labels labeled in the training set.

Formula (4) is a scoring formula, and the gradient of the loss function in the model training process is reduced to optimize the model parameters, so that the model can tag the word pairs more accurately.

The implementation verification is as follows:

the performance of the entity relationship joint extraction model is evaluated on two reference data sets NYT and WebNLG.

According to the marking strategy of the reference data set, the two versions can be divided into two versions, and the versions are distinguished by NYT, webNLG and WebNLG.

Wherein, the two versions, NYT and WebNLG, mark all the components of the entity, and NYT and WebNLG mark only the last word of the entity.

In addition, the invention divides the test set into several different subsets according to different overlapping modes of the triples. Detailed statistics of the data set are shown in table 1.

Table 1 data set statistics

For fair comparison, model performance was evaluated using Precision (Prec.), recall (Rec.), and F1-score indices as in the conventional method.

The model is implemented by PyTorch, and is trained and deployed on a server of a video memory 32G Tesla V100-PCIe GPU;

the pretrained model uses a BERT base-shielded English model, adam is used as an optimizer training parameter, and the learning rate is set to be 0.00001;

the method is the same as the previous work, and the maximum token length of the sentence is controlled to be 100;

the output dimension d of Bert is 768;

the mapping dimension for setting the MLPs is 50.

To prevent overfitting, the dropout ratio is set to 0.1.

The experiment compares several comparative baseline models, and the overall experiment results are shown in table 2. -represents null data.

Table 2 experimental results (%)

Experimental results show that the entity relationship joint extraction model obtains good F1 index performance on the reference data set, and most of the F1 index performance is obvious on the two indexes of prec and Rec.

Because NYT (NYT) has more training samples and fewer relation numbers compared with WebNLG (WebNLG), the performance improvement of the model is not very obvious, which also shows that the size of the training set samples is very important for the performance improvement of the model.

At the same time, the performance of the model was also verified under different triple overlap conditions, as shown in table 3.

TABLE 3 results of experiment (%)

The experimental result shows that the performance of the entity relationship joint extraction model is improved on almost all triple overlapping types. In NYT and WebNLG, the number of overlapping triples exceeds one fourth of the total number, and in such a complex context, the superior key to model performance lies in the multiple relationship space modeling method and label strategy of the present invention.

The context information of an entity is first enriched using word pairs rather than token pairs, which is important to the correctness of the model to generate labels. Here a case analysis is performed as shown in fig. 3.

The words "Ampara", "Hospital", "Sri" and "Lanka" in sentence 1 form token sequences 'Am', '# # par', '# # a', 'Hospital', 'Sri' and 'Lanka' after passing through the Tokenize layer.

Assuming that the token pair is used for modeling, the model would label the token pairs ('Am', 'Sri'), ('Am', 'Lanka') and ('Hospital', 'Lanka') with MMH, MMT, and MMT, respectively.

Upon decoding, triplets (Ampara Hospital, county, sri Lanka) are extracted.

Sentence 2 is a deliberate modification of sentence 1, and after the Tokenize layer, the words "Am", "Hospital", "Sri", and "Lanka" are divided into the token sequences 'Am', 'Hospital', 'Sri', and 'Lanka'.

The model would still tag token pairs ('Am', 'Sri', ('Am', 'Lanka') and ('Hospital', 'Lanka') with MMH, MMT, which results in erroneous triples being extracted. It can be seen that token is not very complete for the semantic information inclusion of context.

Table 4 analyzes the comparison of the efficiency of the entity-relationship joint extraction model (Ours) of the present invention with other baseline models.

TABLE 4

The above are only preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples, and all technical solutions that fall under the spirit of the present invention belong to the scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. The entity relationship joint extraction method of the label space of the multi-relationship word pair is realized based on an entity relationship joint extraction model, wherein the entity relationship joint extraction model comprises an input layer, a Tokenize layer, a Max posing layer and a joint extraction layer, and is characterized by comprising the following steps of:

step 2, the Token layer performs Token transformation on the sample sentences received by the input layer according to a word list, and after Bert coding, token semantic expression vectors and dictionaries recording the initial positions of the words in a Token sequence are obtained;

step 3, a Max posing layer, based on the dictionary, performing maximal pooling on Token semantic expression vectors to obtain semantic vector expression of each word in the sentence;

and 4, based on the processing in the step 3, enumerating all word pairs in the sentence by the combined extraction layer, scoring labels for the word pairs in all predefined relationship spaces, and finally performing combined extraction according to the label characteristics.

2. The method for extracting entity relationships of multi-relation word pair label space jointly as claimed in claim 1, wherein the Tokenize layer in step 2 uses Tokenizer in PyTorch Keras Bert package to Token the sample sentences received by the input layer according to the vocabulary.

3. The method for jointly extracting entity relationships in multi-relationship word-pair tag space according to claim 1, wherein the step 2 is implemented for a sentence W = { W = { (W) } ₁ ,w ₂ ,...,w _X }，w _i Representing the ith word in the sentence, and obtaining a Token semantic representation vector after Token transformation and Bert coding

Wherein N represents the number of tokens, t _i Denotes the ith token, W _enc Representing the semantic vector after all tokens in the sentence are coded, d is the dimension of the semantic vector, and obtaining a dictionary Index for recording the starting position of the word in the token sequence.

4. The method for extracting entity relations of multi-relation word pair label space jointly as claimed in claim 1, wherein step 3 fuses token semantic expression vectors into word vector expression by means of Max posing operation, and the formula is:

Index＝[(1,n ₁ ) ₁ ,(n ₁ +1,n ₂ ) ₂ ,...,(n _i ,n) _X ],

wherein, index refers to the dictionary of the recorded words obtained at the Tokenize layer at the beginning of the token sequence;

the slicing operation of the sequence;

Emb _i a vector representation representing the resulting i-th word.

5. The method for extracting entity relations of multi-relation word pair label space jointly according to claim 1, wherein the label strategy adopted in step 4 is specifically as follows:

the decoding is to extract all the predicted triples in the matrix at once according to the specific label meaning.

6. The method for extracting entity relationships in a label space from multiple related words according to claim 5, wherein the label policy sets eight kinds of labels according to the alignment of the characteristics of entity lengths and entities at the head and tail:

SS,SMH,SMT,MSH,MST,MMH,MMT,A；

the MMT represents that a head entity is composed of a plurality of words, a tail entity is composed of a single word, and the current alignment is the head word of the head entity and the tail word of the tail entity or the tail word of the head entity and the tail word of the tail entity;

a indicates an empty label.

7. The method of claim 1, wherein the joint extraction layer of step 4 enumerates all word pairs (Emb) under all predefined relations _i ,Emb _j ) And allocating a high-confidence label to the decoding unit to realize decoding.

8. The method as claimed in claim 7, wherein the step 4 of jointly extracting the entity relationship of the multi-relation word pair tag space applies two low-dimensional multi-layer perceptron MLPs to map the high-dimensional word semantic vector to the low-dimensional entity representation vector:

h _i ＝MLP ^head (Emb _i ),

t _j ＝MLP ^tail (Emb _j )

wherein MLP represents a multi-layer perceptron;

is a multi-layer perceptron dimension;

d _e a dimension that is a representation of an entity;

head, tail represent head and tail entities, respectively.

9. The method for extracting entity relations of a multi-relation word pair label space in a combined manner according to claim 8, wherein the combined extraction layer in step 4 scores labels of each word pair in all predefined relations through one calculation based on low-dimensional entity expression vectors, and the scoring formula is as follows:

wherein, y (h) _i ,r _q ,t _j ) Is a label marked in the training set;

ReLU denotes activation function;

drop represents a dropout strategy;

is a trainable relational projection parameter matrix;

and 8 represents the number of classified labels.