CN106970981B

CN106970981B - Method for constructing relation extraction model based on transfer matrix

Info

Publication number: CN106970981B
Application number: CN201710193366.9A
Authority: CN
Inventors: 罗炳峰; 冯岩松; 贾爱霞; 赵东岩
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2021-01-19
Anticipated expiration: 2037-03-28
Also published as: CN106970981A

Abstract

The invention discloses a method for constructing a relation extraction model based on a transfer matrix. The method comprises the following steps: 1) selecting a basic relation extraction model M, the input of which is a sentence or a group of sentences describing the same pair of subjects and objects, and the output of which is a distribution p of the relations described by the input sentence or the input group of sentences_iAnd generating a vector representation s of the input sentence or the input set of sentences in the intermediate result_i(ii) a 2) According to s_iConstructing a transition matrix T_i(ii) a 3) Distributing the relation p output by the model M_iMultiplying by the transition matrix T_iAnd normalized to obtain a distribution o of possible annotated relations of the input sentence or of the input set of sentences_i(ii) a 4) Distribute o in this relationship_iAnd (4) fitting a noisy label as a target, training the basic relation extraction model M until a preset termination condition is reached, and obtaining a relation extraction model. The model can be free from the influence of noise, so that a better relation extraction effect can be obtained.

Description

Method for constructing relation extraction model based on transfer matrix

Technical Field

The invention relates to a method for enhancing the resistance to noise data in the training process of a relationship extractor by utilizing a transfer matrix so as to improve the relationship extraction performance, belonging to the field of information extraction.

Background

With the development of information technology and internet, more and more text information can be acquired by people, and how to automatically construct a knowledge base by using a large amount of text information becomes a very important problem, so that a computer can better utilize information contained in the texts.

The knowledge base is generally composed of triples in the shape of (subject, predicate, object), for example, the triplet (china, capital, beijing) contains the knowledge that the capital of china is beijing. Thus, the process of automatically building a knowledge base is the process of automatically generating these triples. The relation extraction aims to solve the problem of how to automatically identify the relation between two examples (one example can be an entity such as 'China', and also can be time, numerical values and the like) described in the text, so that the triples are formed and filled in a knowledge base.

The data used by the relational extraction task is mainly constructed by a remote supervision method, namely, a word description which can possibly set forth some seed knowledge is automatically retrieved by using the seed knowledge, and then the noisy data is used for training a relational extraction model. The benefit of this approach is that a large amount of training data can be acquired at low cost, while the disadvantage is that a large portion of the data set will be noisy data. Also, manually labeled data may contain noise, for example, a person may miss labeling certain data because of carelessness, or may not be able to identify whether a sentence is describing a relationship because of insufficient domain knowledge. Since the noise data can have a significant effect on the training of the model, the quality of a relationship extraction model depends largely on its resistance to the noise data.

Disclosure of Invention

The invention aims to provide a method for constructing a relation extraction model with strong resistance to noise in training data. The input of the relation extraction task can be a sentence, namely, the relation between the target subject and the target object described by the input sentence is judged; or a set of sentences, each of which includes the target subject and the target object, that is, the relationship between the target subject and the target object is comprehensively judged in combination with the set of sentences. Assuming that | C | relationships are to be extracted, a | C | × | C | transition matrix T is constructed, where any element T in the matrix T_ijRepresenting the probability that the true relationship expressed by an input sentence (or a set of sentences input) is i, and it is incorrectly labeled as relationship j. Thus, given a basic relationship extraction model M, let its output relationship distribution be p, in trainingInstead of fitting p directly to the noisy label, the invention uses a transfer matrix T to convert p into a relationship distribution o to which it may be labeled, and uses o to fit the noisy label. Through the display modeling mode of the noise, the basic relation extractor M can be free from the influence of the noise during training, so that a better relation extraction effect can be obtained.

In order to achieve the purpose, the technical scheme of the invention is as follows:

(1) selecting a base relationship extraction model M that can generate a vector representation of an input sentence (or an overall vector representation of an input set of sentences), which outputs a distribution p of relationships described for a to-be-processed sentence i (or an ith set of sentences to be processed)_iAnd generating a vector representation s of the sentence i to be processed (or the ith group of sentences to be processed) in the intermediate result_i。

(2) From the vector representation s of the current sentence i to be processed (or the ith set of sentences to be processed)_iConstructing a transition matrix T_i. Wherein

(i.e., matrix T)_iRow j and column k) represents the probability that, for a sentence i to be processed (or an ith group of sentences to be processed), if its expressed relationship is j, it is incorrectly labeled k (where j and k are respectively associated with the matrix T)_iCorresponding to the jth row and kth column). I.e. this transition matrix can be considered as a model of the noise pattern of an input sentence (or a set of sentences of the input).

(3) During training, the relation distribution p described by the input sentence (or the input sentence) predicted by the basic relation extraction model M is distributed_iMultiplication by a transfer matrix T_iAnd normalized to obtain the distribution o of the relationships it may be labeled as_iAnd let o_iAnd (4) training by fitting a noisy label as a target until a preset termination condition is reached (for example, a pre-specified training round number is reached, or the extraction effect is improved slightly compared with that of the previous round, and the like).

(4) Through the step (3), the basic relationship extraction model M is fully trained. Due to the introduction of the transfer matrix, the influence of noise can be avoided when M is trained, and therefore a better relation extraction effect is achieved. In actual use, the prediction result of M is directly used, and a transfer matrix is not used.

In step (1), although the present invention defines that the basic relational extraction model M must be able to generate a vector representation of an input sentence or an overall vector representation of a group of input sentences, in practice most of the relational extraction models meet this requirement. For a traditional model based on a manually formulated feature template, the vector representation of the sentence may be a vector of extracted feature components; for neural network models, the layer in front of the output layer can generally be used as a vector representation of the sentence (other layers that can completely model the entire sentence can also be used). The overall vector representation of a group of sentences may be a weighted average of the vector representations of each sentence in the group, or may be a weighted average of the vector representations of each sentence after each sentence is further modeled by a recurrent neural network.

In step (2), s is represented from a sentence vector (or an overall vector representation of a set of sentences)_iTo the transition matrix T_iCan be represented by the following sub-formula:

wherein, w_jkIs used to calculate the transition matrix T_iIs at the jth row and kth column of

Parameter (d) of (a) in the above formula w_jkThe superscript T of (a) denotes transpose, b is the bias term. w is a_jkAnd b is initialized randomly before training begins and updated by back propagation during training.

In addition, sometimes a model may generate a vector representation s for each relation/for an input sentence (or a set of input sentences)_il. At this time, the matrix T is transferred_iCan be obtained byGenerating an equation:

wherein s is_ijIs a vector representation of sentence i (or ith group of sentences) for relationship j, w_kIs the weight vector of the relation k, the superscript T denotes the transposition, b_jIs the bias term for relationship j. w is a_kAnd b_jThe method is initialized randomly before training begins and updated through back propagation in the training process.

And, the vector representation of sentence i (or the overall vector representation of the ith group of sentences) s_iA new vector representation of sentence i (or an overall vector representation of the ith set of sentences) s 'may also be derived through several fully-connected layers'_iAnd then generating a transfer matrix by using the formula.

In step (3), two different training modes may be used depending on whether the training data may be divided into several subsets of different noise levels.

1) If the training data is not further divisible according to the noise level, then it is necessary to use a progressive approach for training. The loss function at this time is as follows:

wherein N is the number of samples in the training set, one sample can be a sentence or a group of sentences,

representing the output p of the basis relation extractor M for the samples i_iThe error of the labeling fitted with noise,

distribution o representing the relationship that a sample i may be labeled as_iError of fitting a noisy annotation, Trace (T)_i) Representing a transition matrix T_iA is a real number between 0 and 1, betaIs a real number coefficient. The error loss can be used here with all piecewise-derivative functions that measure the difference between the predicted relationship distribution and the annotated relationship, including the cross entropy. Trace (T)_i) Here a regularization term. Taking into account the transition matrix T_iIs 1, and T is the case without noise_iShould be a unit array, then by controlling T_iThe trace of (sum of diagonal elements) is equivalent to the noise modeling strength of the control transfer matrix. Large beta indicates the desired T of the invention_iApproaching a unit array, with a small (or negative) beta indicating that the invention encourages T_iTo model the noise.

At the beginning of the training, α is set to 1 and β is set to a larger positive number, i.e., the present invention initially discourages modeling noise, but instead expects the fundamental relationship extraction model M to quickly learn the basic classification capability from noisy labels. Then, by gradually reducing α and β, the invention gradually emphasizes the importance of noise modeling during the training process, thereby reducing the influence of noise on the training of M. The optimization method used for training can be all gradient-based optimization methods including random gradient descent. And when the preset termination condition is reached, stopping training. The termination condition may be that the training reaches a certain number of rounds, or that the relationship extraction effect on the development set is not significantly improved any more, and the like.

2) If the training data can be divided into subsets (TD) according to the noise level from small to large₁,TD₂… …), the following loss function may be used:

where S is the number of subsets, N_iIs a subset TD_iThe number of samples in (2), a sample may be a sentence or a group of sentences,

as a subset TD_iDistribution of relationships into which the jth sample may be labeledo_ijError of fitting noisy annotations, beta_iIs a subset TD_iUp-transfer matrix T_ijThe regular term coefficients of the trace. Here, since the noise modeling strength of the transition matrix on each subset can be specified in advance by using the prior knowledge of the noise degrees of different subsets, l may not be used_osspTo directly optimize l_osso. In particular, for a subset with a low degree of noise, β may be set_iFor a subset with a relatively large positive number but a high degree of noise, β may be set_iNegative or relatively small positive.

The loss function loss, described above, can use all piecewise-derivative functions that measure the difference between the predicted relationship distribution and the labeled relationship, including the cross entropy. The optimization method can employ all gradient-based optimization methods including random gradient descent. During training, except training on all subsets directly, the training can also be performed on the subset with the minimum noise degree, and then the other subsets are gradually added into the training set from small to large according to the sequence of the noise degree. The termination condition of the training can be that the training reaches a certain number of rounds, or the relationship extraction effect on the development set is not obviously improved any more, and the like.

In the step (4), since the basic relationship extraction model M is actually fitted with the implicit real relationship distribution during training, the prediction result of the basic relationship extraction model M can be directly used during actual use.

Compared with the prior art, the invention has the following positive effects:

during training, the basic relationship extraction model M does not need to be directly fitted with a noisy label, but the distribution of the relationship described by the input sentence (or a group of input sentences) predicted by the model M is connected with the noisy label through a transfer matrix, so that the denoising effect is achieved. Compared with the existing method for directly fitting noisy labels with the output of M, the method disclosed by the invention has the advantages that the training of the basic relationship extraction model M is free from the influence of noise, the generation of biased models is avoided, and a better relationship extraction effect can be achieved.

Fig. 2 shows the effect of the present invention on the time of extracting objects (see the description of the following examples), and the extraction result is shown as an accuracy-recall curve, and the higher the curve is, the better the effect is. The data set can be divided into 3 subsets with different noise degrees, the effect of training on reliable subsets and then on unreliable subsets (training of the data sets in sequence) is obviously superior to the effect of training by mixing all the subsets together (mixed data set), and the noise in the data set has great influence on the training of the model. After the transfer matrix method disclosed by the invention is used, the relation extraction effect is further improved (training and transfer matrix are performed in sequence), and the method can effectively improve the anti-noise capability of the basic relation extraction model in the training process, so that the relation extraction effect is obviously improved.

Drawings

FIG. 1 is a block diagram of a relationship extraction method in an embodiment of the invention;

fig. 2 is a diagram illustrating an extraction effect of the relationship extraction method according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention is based on the relation that the object is time in an open encyclopedia knowledge platform of the wiki data and the corpus of the wiki encyclopedia. It will be apparent to those skilled in the art that other sets of relationships and other corpora may be used in a particular implementation.

Specifically, there are 12 types of relationships to be extracted in this embodiment, including a birth date, a death date, an organization establishment date, a work publication date, a spacecraft launch date, and the like. The data set construction process is as follows:

1) collecting triplets in the wiki data containing relationships to be extracted, such as (centesimal, organization date, 1/2000);

2) for a triplet, find all sentences in wikipedia containing both the subject and object in the triplet, such as "the plum-shaped hydroarm creates a Baidu in 1 month and 1 day of 2000";

3) the present invention considers the sentences containing both the subject and the object of the triplet as the natural language description of the triplet. However, this assumption is incomplete and exceptions necessarily occur. The present invention then further assumes that the finer the temporal granularity mentioned in the sentence, the more likely it is that the present invention considers it to describe the triplet. For example, a sentence containing "1/2000" and "hundredths" is more likely to describe this triple (hundredths, organization establishment date, 1/2000) than a sentence containing only "2000" and "hundredths". According to this principle, the present invention divides the data set into three subsets of different degrees of reliability: including yearly-monthly days, including yearly-monthly, including yearly;

4) the invention takes out a part of the triples of the subset including the year, the month and the day as the test set, and the rest of the triples are used as the training set (sentences related to the triples of the test set in the other two subsets are also removed).

FIG. 1 is a diagram of a framework of a relationship extraction method based on a transition matrix according to an embodiment of the present invention;

step 1: a vector representation of the sentence to be processed is generated.

For an input sentence, the present invention first generates a vector representation thereof. Here, the present invention first converts each word in the sentence into a corresponding word vector, and then generates a vector representation of the sentence through a Convolutional Neural network (see Zeng Daojian, Liu Kang, Chen Yubo, Zhao jun.

Step 2: a distribution of the relationships described by the sentence is generated using a base relationship extractor.

Obtaining a vector representation of a sentences_iThen, the invention generates the predicted relationship distribution through the softmax classifier, and the concrete formula is as follows:

wherein p is_ijIs the distribution p of the relationships described by the sentence i_iItem j of (d), representing the probability that sentence i describes relationship j, w_jIs the weight vector of the relationship j, w_jThe superscript T of (a) denotes transpose.

And step 3: modeling noise to generate transfer matrix T_i。

Since there is only one vector representation for a sentence, the present invention generates the transition matrix T using the following formula mentioned in the methods section_i：

Wherein

Is a transfer matrix T_iRow j and column k.

And 4, step 4: distribution of relationships p using predictions_iAnd a transfer matrix T_iGenerating a distribution o of relationships that the sentence may be tagged to_i：

Wherein o is_ijRepresents o_iAnd the second formula is p'_iThe normalization is performed such that it satisfies the properties of the probability distribution.

During training, the invention trains 15 rounds on the subset of 'including year, month and day', then trains 15 rounds after adding the subset of 'including year and month', and trains 15 rounds after adding the subset of 'including year'. Wherein the regularization coefficients (coefficients of the traces of the transfer matrix) of the three subsets "including year, month and day", "including year" and "including year" are 0.01, -0.01 and-0.1, respectively.

Fig. 2 shows the relationship extraction effect of the method, and the evaluation index is an accuracy-recall curve. Specifically, the relational extraction results are ranked from high to low according to the confidence coefficient output by the relational extractor, the accuracy and the recall ratio are calculated for each extraction result and the result higher than the confidence coefficient of each extraction result in sequence, and finally an accuracy-recall ratio curve is drawn, wherein the higher the curve is, the better the relational extraction effect is. The calculation method of the accuracy and the recall rate comprises the following steps:

it can be seen from the figure that, under the condition of not adding a transition matrix, different data subsets are sequentially added into a training set for training according to the sequence of the noise degrees from small to large (the data sets are sequentially trained), which is better than the training effect directly on all the data sets (mixed data sets), and shows that the noise in the data sets has a significant influence on the effect of the trained relation extraction model. After the transfer matrix is added (training and transferring the matrix in sequence), the relation extraction effect is obviously improved compared with that before the transfer matrix is added, and the method can effectively model noise, so that a basic relation extractor is free from the influence of the noise, and a better relation extraction effect is obtained.

In summary, in the embodiment of the present invention, a reliable relationship extractor for extracting a relationship that an object is time is constructed based on wiki data and wikipedia. In the process of training the extractor, the method provided by the invention can effectively avoid the influence of noise in data on the relation extractor, thereby training the relation extractor with better effect.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims

1. A method for constructing a relation extraction model based on a transfer matrix comprises the following steps:

1) selecting a basic relation extraction model M, the output of which is a distribution p of the relations described by the input sentence i_iAnd generating a vector representation s of the input sentence i in the intermediate result_i；

2) From the vector representation s of the input sentence i_iConstructing a transition matrix T_i(ii) a Wherein the content of the first and second substances,

is a matrix T_iThe jth row and kth column of (a), representing the probability that the input sentence i expresses a relationship of j, but is incorrectly labeled as k;

3) the relation distribution p output by the basic relation extraction model M_iMultiplying by the transition matrix T_iAnd normalizing to obtain the distribution o of the relationship which the input sentence may be labeled into_i；

4) Distribute o in this relationship_iFitting a noisy label as a target, and training the basic relationship extraction model M until a preset termination condition is reached to obtain a relationship extraction model;

the method for training the basic relationship extraction model M comprises the following steps: a) if the training data cannot be further divided according to the noise level, the loss function used in the training is

Wherein N is a sample of training dataThe total number, a sample is a sentence or a group of sentences,

representing a relationship distribution p output by a basis relationship extractor M for a sample i_iThe error of the labeling fitted with noise,

representing a distribution of relationships o to which a sample i may be labeled_iError of fitting noisy annotations, Trace (T)_i) Representing the transition matrix T corresponding to the sample i_iα is a real number between 0 and 1, β is a real number coefficient; b) if the training data can be further divided into subsets according to the degree of noise, the loss function used in the training is

Where S is the total number of subsets, N_iIs a subset TD_iThe number of samples of (a) to (b),

as a subset TD_iIs marked as a distribution o of relationships_ijError of fitting noisy annotations, beta_iIs a subset TD_iUp-transfer matrix T_ijThe regular term coefficients of the trace.

2. The method of claim 1, wherein if the fundamental relationship extraction model M generates only one vector representation s for an input sentence i_iThen it is stated

Is determined by the parameters of (a) and (b),

is w_jkB is a bias term, w_jkAnd b are updated by back-propagation during training, | C | is the size of the set of relationships to be extracted.

3. The method of claim 1, wherein if the basic relationship extraction model M generates a vector representation s for each relationship/for an input sentence i_ilThen it is stated

Wherein s is_ijIs a vector representation of the input sentence i for the relation j, w_kIs a weight vector for the relationship k,

is w_kTranspose of (b)_jIs a bias term of the relationship j, w_kAnd b_jUpdated by back-propagation during the training process, | C | is the size of the set of relationships to be extracted.

4. The method of claim 1, wherein the vector can be further represented as s_iAfter processing of a plurality of full connection layers, a new vector representation s 'of sentence i is obtained'_iThen according to s'_iConstructing the transfer matrix T_i。

5. A method for constructing a relation extraction model based on a transfer matrix comprises the following steps:

1) selecting a basic relationship extraction model M, which is input as a set of sentences describing the same pair of subject and object, and which is output as a distribution p of the relationships described by the set of sentences_iAnd generating an overall vector representation s of the input set of sentences in the intermediate result_i；

2) From the global vector representation s of the set of sentences_iConstructing a transition matrix T_i(ii) a Wherein the content of the first and second substances,

is a matrix T_iThe jth row and kth column of (a), indicating a probability that the set of sentences expresses a relationship of j, but is incorrectly labeled as k;

3) the relation distribution p output by the basic relation extraction model M_iMultiplying by the transition matrix T_iAnd normalizing to obtain the distribution o of the possible labeled relations of a group of input sentences_i；

Wherein N is the total number of samples of the training data, one sample can be a sentence or a group of sentences,

6. The method of claim 5, wherein if the fundamental relationship extraction model M generates only one vector representation s for the set of sentences_iThen it is stated

Is determined by the parameters of (a) and (b),

7. The method of claim 5, wherein if the fundamental relationship extraction model M generates a vector representation s for each relationship/for the set of sentences_ilThen it is stated

Wherein s is_ijIs a vector representation, w, of the set of sentences for the relationship j_kIs a weight vector for the relationship k,

8. The method of claim 5, wherein the vector can be further represented as s_iAfter several full connected layers processing, a new overall vector representation s 'of the set of sentences is obtained'_iThen according to s'_iConstructing the transfer matrix T_i。