CN108287911B

CN108287911B - Relation extraction method based on constrained remote supervision

Info

Publication number: CN108287911B
Application number: CN201810103633.3A
Authority: CN
Inventors: 汤斯亮; 张金剑; 袁愈锦; 吴飞; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-02-01
Filing date: 2018-02-01
Publication date: 2020-04-24
Anticipated expiration: 2038-02-01
Also published as: CN108287911A

Abstract

The invention discloses a relation extraction method based on constrained remote supervision, which comprises the following steps: (1) constructing an external knowledge base; (2) acquiring text data; (3) obtaining a sentence containing the attribute by using a remote supervision method; (4) obtaining confidence information of the sentence by using a pre-trained model; (5) and (5) regularizing the network by using the confidence information, and calculating the normalized posterior probability to obtain a relationship label. The invention provides a relation extraction method based on constrained remote supervision. The invention also uses the method of regularization posterior probability to automatically extract the characteristics of the text sentences, thereby saving the manual work and simultaneously extracting more abstract and expressive characteristics. The method is superior in effect to the traditional relation extraction algorithm and some mainstream square algorithms in recent years.

Description

Relation extraction method based on constrained remote supervision

Technical Field

The invention relates to text feature extraction and relationship extraction, in particular to a relationship extraction method based on constrained remote supervision.

Background

The world is in an era of information explosion, and the popularity and the high-speed development of the internet generate massive information resources. The significance of these resources to the development of science and technology is significant, the scientific community needs to extract the basic materials of scientific research, and the industry needs to mine potential business opportunities, so how to utilize these internet information resources is one of the mainstream directions of the scientific research in recent years.

Although the amount of information resources in the internet is enormous, these areResources tend to lack structured properties. Structured data refers to row data, data that can be expressed in a two-dimensional table structure, and unstructured data^[2]The field length of (a) is variable and inconvenient to express using a two-dimensional logical table. Since many of these resources are unstructured or semi-structured data, finding and understanding this data quickly and efficiently is greatly limited.

Text data is an important part of the information resources of the internet, and most of the text data on the internet is also unstructured data, such as news, blogs, emails, government documents, chat records, system logs, and the like. To be able to efficiently utilize these unstructured text data, information extraction techniques have come to mind-automatically translating unstructured or semi-structured text in an input page into structured data. The information extraction task is defined by the target of input and extraction, and the input can be an unstructured document written by using a natural language or a semi-structured document on a webpage; and the extraction target is a k-tuple (k is the number of attributes of a record) relationship or a complex hierarchical data object.

The traditional relation extraction technology has many disadvantages, firstly, whether based on rules or classification algorithms, more manual intervention is needed, such as rule design based on rules, data labeling based on classification and feature design, the cost of the manual intervention is high, more authoritative manual data can be obtained only by labeling of professionals, meanwhile, certain errors can be brought by manual work, the errors can be continuously accumulated in subsequent algorithms, and finally, the deviation of results is overlarge; secondly, the training data set of the algorithm is limited in a certain field, namely, the algorithm has no universality, for example, a trained relation extraction classifier related to the aspect of sports news cannot be well used in other news; generally, the effect obtained by the algorithm is not ideal enough because the manually designed rule in the rule-based method is limited, and the labeled data of the classification-based method is limited and the method depends on the quality of the manually designed feature.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a relation extraction method based on constrained remote supervision. The invention adopts the following specific technical scheme:

the relation extraction method based on constrained remote supervision comprises the following steps

S1: acquiring information frame data of Wikipedia, converting the information frame data into entity pairs, and constructing a structured external knowledge base;

s2: obtaining forum and news data, and constructing an unstructured text corpus;

s3: searching sentences containing entity pairs in a text corpus by using a remote supervision method to form an original sentence set;

s4: marking partial sentences, training a model by using marked data to obtain a pre-training model, and inputting the unprocessed sentences into the pre-training model to obtain the posterior probability of model output;

s5: and inputting the original sentence set data set and the posterior probability thereof into a network, and training a model to obtain the relational label.

In the above scheme, each step can be specifically realized by adopting the following mode:

the S1 specifically includes the following steps:

s11: downloading the public data of Wikipedia;

s12: extracting a data frame of an entry in Wikipedia, mapping the name of the data frame to a relation name, and storing an attribute value and an entity name to form an entity pair;

s13: for all entries, attribute values and entity names that have the same relationship are saved together.

The S2 specifically includes the following steps:

s21: downloading news data;

s22: preprocessing the text data, removing tags such as HTML (hypertext markup language) or XML (extensible markup language) and the like, converting the character coding format into utf-8, and converting the format into pure text data;

s23: and using a natural language processing tool to perform word segmentation on the plain text data and extracting named entity information.

The S3 specifically includes the following steps:

s31: constructing a positive sample: under the same relation type, if the entity name and the attribute value appear in a certain sentence at the same time, marking the sentence as a positive sample;

s32: constructing a negative sample: under the same relationship type, if an entity name appears in a sentence, an attribute value does not appear in the sentence, but the sentence contains named entity information of the attribute value, the sentence is marked as a negative sample;

s33: number of equalized samples: and randomly sampling the negative samples to ensure that the number of the negative samples is equal to that of the positive samples.

The S4 specifically includes the following steps:

s41: and selecting the sentences searched in the part S3 to be stored in the following sequence form:

[sentence₁sentence₂sentence₃...sentence_N]

wherein N is the number of selected sentences, sensor_NIs the Nth sentence;

s42, manually labeling the selected sentences, and judging whether the sentences contain the relationship:

{sentence₁:label₁,sentence₂:label₂...sentence_N:label_N}

wherein label_NA tag representing an nth sentence;

s43, inputting the marked sentences into a classification algorithm, and training a network to obtain a pre-training model theta;

s44, selecting the sentences which are not manually marked and storing the sentences in the following sequence form:

[sentence₁sentence₂sentence₃...sentence_M]

wherein M represents the number of unlabeled sentences;

s45: inputting the unlabeled sentences into a pre-training model theta to obtain the confidence coefficient:

{sentence₁:confidence₁,sentence₂:confidence₂...sentence_M:confidence_M}

wherein confidence_MRepresenting the confidence of the mth sentence.

The S5 specifically includes the following steps:

s51: acquiring a training data set:

x＝[x₁,x₂,...,x_l]

wherein x_lRepresenting the ith sentence in the training set, wherein l represents the number of sentences in the training set;

s52: inputting sentences in the training data set into a pre-training model theta to obtain posterior probability output (x ') of the pre-training model output on the jth class'_j)：

Wherein K represents the number of classes, p_jThe output result of the pre-training model on the jth class is obtained;

s53: inputting the confidence of the sample into the network to calculate the constraint value:

con＝exp(η(λ-cofidence))

wherein η represents a penalty factor and λ represents a set threshold;

s54: the posterior probabilities of all the relations are normalized as follows:

Z＝∑exp(η(λ-cofidence))

wherein p (x'_j) The posterior probability after normalization processing is adopted;

s56: and selecting the relation with the maximum posterior probability as a prediction label of relation classification.

In S53, the confidence level of the input sample includes two parts, one part is the confidence level of the labeled sentence, and the confidence level is 1; the other part is an unlabeled sentence, and the confidence level of the unlabeled sentence is calculated according to the step S45.

In order to overcome various defects in the traditional method, the invention provides a relation extraction method based on constrained remote supervision. The invention also uses the method of regularization posterior probability to automatically extract the characteristics of the text sentences, thereby saving the manual work and simultaneously extracting more abstract and expressive characteristics. The method is superior in effect to the traditional relation extraction algorithm and some mainstream square algorithms in recent years.

Drawings

FIG. 1 is a pseudo-graphical representation of a core utilization regularization remote supervision model used by the present invention. Two positive samples and two negative samples are respectively shown on the left side of the figure, and the positive samples and the negative samples are extracted by adopting a remote supervision method. And (4) obtaining a prediction label after the calculation according to the posterior probability and the regularization.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

A relation extraction method based on constrained remote supervision comprises the following steps:

s1: acquiring information frame data of Wikipedia, converting the information frame data into entity pairs, and constructing a structured external knowledge base; the specific implementation manner of the step is as follows:

s11: downloading the public data of Wikipedia;

S2: obtaining forum and news data, and constructing an unstructured text corpus; the specific implementation manner of the step is as follows:

s21: downloading news data, such as public data of a people daily report;

s23: the plain text data is tokenized using natural language processing tools, such as the ending tokenization, and named entity information is extracted.

The structured data and the unstructured data in the invention directly adopt a TAC-KBP 2016 data set.

S3: as shown in fig. 1, using a remote supervision method, a text corpus is searched for sentences containing entity pairs to form an original sentence set; the specific implementation manner of the step is as follows:

S4: marking partial sentences, training a model by using marked data to obtain a pre-training model, and inputting the unprocessed sentences into the pre-training model to obtain the posterior probability of model output; the specific implementation manner of the step is as follows:

[sentence₁sentence₂sentence₃...sentence_N]

wherein N is the number of selected sentences, sensor_NIs the Nth sentence;

{sentence₁:label₁,sentence₂:label₂...sentence_N:label_N}

wherein label_NA tag representing an nth sentence;

[sentence₁sentence₂sentence₃...sentence_M]

wherein M represents the number of unlabeled sentences;

{sentence₁:confidence₁,sentence₂:confidence₂...sentence_M:confidence_M}

wherein confidence_MRepresenting the confidence of the mth sentence.

S5: and inputting the original sentence set data set and the posterior probability thereof into a network, and training a model to obtain the relational label. The specific implementation manner of the step is as follows:

s51: acquiring a training data set:

x＝[x₁,x₂,...,x_l]

s53: inputting the confidence coefficient of the sample into a network to calculate a constraint value of the network, wherein the confidence coefficient of the input sample comprises two parts, one part is the confidence coefficient of a labeled sentence, and the confidence coefficient is 1; the other part is an unlabeled sentence, and the confidence level of the unlabeled sentence is calculated according to the step S45. The constraint value calculation formula is as follows:

con＝exp(η(λ-cofidence))

wherein η represents a penalty factor and λ represents a set threshold;

Z＝∑exp(η(λ-cofidence))

The method is applied to the following examples in order that those skilled in the art will better understand the specific implementation of the present invention.

Examples

In this embodiment, taking a section of news text submitted by a user as an example, the relationship extraction is performed by using the above method, and specific parameters and methods in each implementation step are as follows:

1. the information frame data is converted into entity pairs which are stored in a sequence form and correspond to the sequence of sentences in the candidate set.

{(entity1,slotfiller1),(entity2,slotfiller2),...(entityN₁,slotfillerN₁)}

2. Searching whether the input sentence contains entity, and forming the sentence containing entity into original sentence set

{sentence1,sentence2,...sentencesN}

3. And searching whether the original sentence set contains the attribute value or not, and forming the sentences containing the attribute value into a candidate sentence set, wherein the sentences in the candidate set simultaneously contain the entity and the attribute value.

{candidate1,candidate2,...candidateN₁}

4. And manually marking partial data to obtain manually marked accurate data.

{sentence₁:label₁,sentence₂:label₂...sentence_N:label_N}

5. And training the model to obtain a parameter theta.

{word₁:vector₁,word₂:vector₂...word_N:vector_N}

6. And acquiring candidate data which are not manually marked.

[sentence₁sentence₂sentence₃...sentence_M]

7. Inputting the candidate data which are not labeled manually into the trained network, and respectively obtaining the corresponding confidence degrees.

{sentence₁:confidence₁,sentence₂:confidence₂...sentence_N:confidence_M}

8. Inputting the candidate data into the network to obtain the posterior probability

9. Calculating a constraint value based on the confidence

con＝exp(η(λ-cofidence))

10. Calculating normalization parameters

Z＝∑exp(η(λ-cofidence))

11. Calculating normalized posterior probability

12. The category in which the maximum value of the posterior probability is located is the prediction label.

As shown in Table 1, the comparison of the method described in the present invention with the pre-existing mainstream method on the TAC-KBP 2016 dataset shows that the present invention has significant advantages in Precision, Recall and F1-Score evaluation criteria.

TABLE 1

Model (model)	Precision	Recall	F1-Score
				PCNN	-	-	0.52
CNN	0.499	0.483	0.453
				Text model	0.547	0.559	0.553

Claims

1. A relation extraction method based on constrained remote supervision is characterized by comprising the following steps:

s5: inputting the original sentence set data set and the posterior probability thereof into a network, training a model and obtaining a relational tag;

the S5 specifically includes the following steps:

s51: acquiring a training data set:

x＝[x₁,x₂,...,x_l]

s52: inputting sentences in the training data set into a pre-training model theta to obtain the posterior probability output (x) of the pre-training model output on the jth class_j)：

con＝exp(η(λ-cofidence))

wherein η represents a penalty factor and λ represents a set threshold;

Z＝∑exp(η(λ-cofidence))

wherein p (x)_j) The posterior probability after normalization processing is adopted;

2. The method for extracting relationship based on constrained remote supervision as recited in claim 1, wherein the step S1 specifically comprises the following steps:

s11: downloading the public data of Wikipedia;

3. The method for extracting relationship based on constrained remote supervision as recited in claim 1, wherein the step S2 specifically comprises the following steps:

s21: downloading news data;

s22: preprocessing the text data, removing HTML or XML labels, converting the character coding format into utf-8, and converting the format into pure text data;

4. The method for extracting relationship based on constrained remote supervision as recited in claim 1, wherein the step S3 specifically comprises the following steps:

s32: constructing a negative sample: under the same relation type, if an entity name appears in a certain sentence, an attribute value does not appear in the sentence, but the sentence contains named entity information of the attribute value, the sentence is marked as a negative sample;

5. The method for extracting relationship based on constrained remote supervision as recited in claim 1, wherein the step S4 specifically comprises the following steps:

[sentence₁sentence₂sentence₃... sentence_N]

wherein N is the number of selected sentences, sensor_NIs the Nth sentence;

{sentence₁:label₁,sentence₂:label₂...sentence_N:label_N}

wherein label_NA tag representing an nth sentence;

[sentence₁sentence₂sentence₃... sentence_M]

wherein M represents the number of unlabeled sentences;

{sentence₁:confidence₁,sentence₂:confidence₂...sentence_M:confidence_M}

wherein confidence_MRepresenting the confidence of the mth sentence.

6. The method for extracting relationship based on constrained remote supervision as claimed in claim 1, wherein in S53, the confidence level of the input sample comprises two parts, one part is the confidence level of the labeled sentence, and the confidence level is 1; the other part is an unlabeled sentence, and the confidence level of the unlabeled sentence is calculated according to the step S45.