CN111125370A

CN111125370A - Relation extraction method suitable for small samples

Info

Publication number: CN111125370A
Application number: CN201911240521.3A
Authority: CN
Inventors: 卓可秋; 杨秀燕
Original assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Current assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-05-08

Abstract

The invention discloses a relation extraction method suitable for small samples, which comprises the following steps: (1) acquiring training data; (2) training a general domain relation knowledge model; (3) and training a specific domain relation extraction model. The method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.

Description

Relation extraction method suitable for small samples

Technical Field

The invention relates to the technical field of knowledge graph construction, in particular to a relation extraction method suitable for small samples.

Background

Information extraction is an important component in natural language processing, and especially in the current information society, it is particularly significant to extract useful information from mass data. The information extraction can be divided into entity extraction, relationship extraction, event extraction and the like. The relation extraction is to judge whether a certain relation exists in the entity pair on the basis of the extracted entity pair, and the no relation is also taken as a special relation.

As the relationship extraction is shifted from the defined relationship types to the various relationship types in the open domain, the data source is shifted from the standard corpus to massive network data. The traditional method based on pattern matching cannot adapt to various expression forms of a large number of relation types, and all patterns are difficult to define manually. Machine learning and deep learning methods based on supervised learning need to manually label large-scale experimental corpora, and have high accuracy. But the cost of the needed manual labeling is too large, so that the training data is extremely lack. Therefore, research direction turns to that the performance of model training can be still improved when the amount of manual labeling data is reduced.

In 2009, Mintz et al proposed a remote supervision method, and the basic idea was to rely on an existing knowledge base, obtain a text containing an entity pair in the knowledge base from the text as a training corpus, and Mintz proposed an assumption that if a certain relationship of a certain entity pair exists in the knowledge base, all data containing the pair of entities express the relationship. Although this approach can greatly reduce the trouble of manual labeling, remote surveillance also poses a negative problem in that tag data has a large amount of noisy data. This is exactly the problem that remote supervision needs to solve. In 2018, Google provides a strong Bert model, the performance is excellent when processing downstream tasks, and other corresponding tasks can be processed only by connecting a network layer behind the Bert model for fine adjustment. However, for various relation types, in the process of relation extraction in a specific field, even if basic language knowledge can be learned by using a Bert model to improve the extraction result, the problem that the data which needs to depend on marking cannot be solved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a relation extraction method suitable for small samples, so that time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.

In order to solve the above technical problem, the present invention provides a relationship extraction method adapted to small samples, comprising the following steps:

(1) acquiring training data;

(2) training a general domain relation knowledge model;

(3) and training a specific domain relation extraction model.

Preferably, in the step (1), the acquiring of the training data specifically includes: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;

(11) collecting unformatted text data on the Internet by using a crawler tool;

(12) acquiring triple group data (comprising relationship names and entity pairs) on a public data set Freebase;

(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;

(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;

(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.

Preferably, in the step (2), the training of the general domain relationship knowledge model specifically includes the following steps:

(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;

(22) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.

Preferably, in the step (3), the specific domain relationship extraction model training specifically includes the following steps:

(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;

(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;

(33) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.

The invention has the beneficial effects that: the method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.

Drawings

FIG. 1 is a diagram of a generic domain knowledge model structure according to the present invention.

FIG. 2 is a schematic diagram of the structure of the Transformer mechanism layer of the present invention.

FIG. 3 is a schematic diagram of a training expectation acquisition flow of the present invention.

FIG. 4 is a diagram illustrating relationship extraction in accordance with the specific embodiment of the present invention.

Detailed Description

A relation extraction method adapting to small samples comprises the following steps:

(1) acquiring training data;

(2) training a general domain relation knowledge model;

(3) and training a specific domain relation extraction model.

In the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;

(11) collecting unformatted text data on the Internet by using a crawler tool;

In the step (2), the training of the general domain relationship knowledge model specifically comprises the following steps:

(22) setting parameters of network training: the Batchsize is 64, the initial learning rate is 1e-4, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, Epoch is 100, and the maximum sentence length is 200.

In the step (3), the specific domain relation extraction model training specifically comprises the following steps:

(33) setting parameters of network training: the Batchsize is 32, the initial learning rate is 1e-5, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, the Epoch is 30, and the maximum sentence length is 200.

The general domain relation knowledge model structure diagram is shown in FIG. 1. Converting sentences which need to be subjected to relation extraction into an Input format (word embedding + position embedding + segment embedding) of a Bert model, and transmitting the Input format (Input embedding) into the Bert model; the output of the Bert model is then passed into the model as input to the transform mechanism layer for training. The knowledge learned by the Transformer mechanism layer is the general domain relationship knowledge. The Bert model and the Transformer mechanism layer form a general domain relation knowledge model.

The schematic structure of the Transformer mechanism layer is shown in fig. 2. The layer mainly utilizes a Transformer mechanism to train the relation knowledge, and comprises two encoders, wherein each encoder consists of two sublayers: one is Multi-Head-attachment, which includes Multi-headed Self-attachment, where 6 heads are selected; the other is a feedforward neural network, which consists of a ReLU activation function and a Linear function.

Acquiring training corpora; in the training stage of the general domain relation knowledge model, the training corpus is mainly obtained through the following ways: 1) obtaining the annotation data by using a public KB (Knowledge Base) to align the annotation method of the naive text, and 2) manually annotating the data set in the prior public. The overall process of corpus acquisition is shown in fig. 3.

And acquiring the annotation data by using an open KB to align the annotation method of the plain text. Firstly, acquiring a public data set based on a public relation knowledge base such as Wikipedia and Freebase, and preprocessing data to obtain a KB (key B), namely a triple comprising a relation name and an entity pair; then, text data are obtained through a web crawler, and entities in the text are extracted through NLP preprocessing methods such as word segmentation and named entity recognition; and assigning the entity pairs and the relations to corresponding sentences in the text data by a remote supervision method. The same entity pair and for a sentence are packed into a bag, and the label of each bag is the relationship type.

Since the originally proposed remote supervision method assumed that if a sentence contained an entity pair involved in a relationship, that sentence was the relationship described. This inevitably leads to a lot of noise data, i.e., data in which the relationship judgment is erroneous, in the obtained labeling data. In order to reduce noise, feature extraction is carried out on each bag by using an unsupervised text clustering algorithm to be divided into two types, and positive examples are selected and reserved as training sets. And a late attention mechanism can be used to capture keywords and fight against noise data introduced from remote surveillance.

And extracting the relation of the specific field based on the general field relation knowledge model. On the basis of pre-training a general field relation knowledge model, only fine adjustment is needed, for example, a small number of samples in a specific field can be subjected to relation training by connecting a full connection layer, so that the problem of marking a large amount of data is avoided, and the network performance is superior. The specific network diagram is shown in fig. 4.

For example, the input sentence: the Lijiang ancient city is located in the Dazhen town of Lijiang city in Yunnan province, the middle part of Lijiang dam under Yulong snow mountain, Beiyi Xiangshan, Jinhong shan, West occipital lion mountain, and the southeast faces dozens of Litian woye.

1) Determining an entity pair: li jiang gucheng, li jiang city;

2) determining the positions [0,4], [10,13] of the sentences in which the entity pairs are located;

3) converting sentence information into Input Embeddings of a general domain relation knowledge model and inputting the Input Embeddings into a network;

4) and finally outputting a relation result: a geographic location.

Claims

1. A relation extraction method adapting to small samples is characterized by comprising the following steps:

(1) acquiring training data;

(2) training a general domain relation knowledge model;

(3) and training a specific domain relation extraction model.

2. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;

(11) collecting unformatted text data on the Internet by using a crawler tool;

(12) acquiring ternary data on a public data set Freebase;

3. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (2), the training of the generic domain relationship knowledge model specifically comprises the following steps:

4. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (3), the training of the domain-specific relationship extraction model specifically comprises the following steps: