CN111125370A - Relation extraction method suitable for small samples - Google Patents
Relation extraction method suitable for small samples Download PDFInfo
- Publication number
- CN111125370A CN111125370A CN201911240521.3A CN201911240521A CN111125370A CN 111125370 A CN111125370 A CN 111125370A CN 201911240521 A CN201911240521 A CN 201911240521A CN 111125370 A CN111125370 A CN 111125370A
- Authority
- CN
- China
- Prior art keywords
- training
- data
- relation
- model
- knowledge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a relation extraction method suitable for small samples, which comprises the following steps: (1) acquiring training data; (2) training a general domain relation knowledge model; (3) and training a specific domain relation extraction model. The method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.
Description
Technical Field
The invention relates to the technical field of knowledge graph construction, in particular to a relation extraction method suitable for small samples.
Background
Information extraction is an important component in natural language processing, and especially in the current information society, it is particularly significant to extract useful information from mass data. The information extraction can be divided into entity extraction, relationship extraction, event extraction and the like. The relation extraction is to judge whether a certain relation exists in the entity pair on the basis of the extracted entity pair, and the no relation is also taken as a special relation.
As the relationship extraction is shifted from the defined relationship types to the various relationship types in the open domain, the data source is shifted from the standard corpus to massive network data. The traditional method based on pattern matching cannot adapt to various expression forms of a large number of relation types, and all patterns are difficult to define manually. Machine learning and deep learning methods based on supervised learning need to manually label large-scale experimental corpora, and have high accuracy. But the cost of the needed manual labeling is too large, so that the training data is extremely lack. Therefore, research direction turns to that the performance of model training can be still improved when the amount of manual labeling data is reduced.
In 2009, Mintz et al proposed a remote supervision method, and the basic idea was to rely on an existing knowledge base, obtain a text containing an entity pair in the knowledge base from the text as a training corpus, and Mintz proposed an assumption that if a certain relationship of a certain entity pair exists in the knowledge base, all data containing the pair of entities express the relationship. Although this approach can greatly reduce the trouble of manual labeling, remote surveillance also poses a negative problem in that tag data has a large amount of noisy data. This is exactly the problem that remote supervision needs to solve. In 2018, Google provides a strong Bert model, the performance is excellent when processing downstream tasks, and other corresponding tasks can be processed only by connecting a network layer behind the Bert model for fine adjustment. However, for various relation types, in the process of relation extraction in a specific field, even if basic language knowledge can be learned by using a Bert model to improve the extraction result, the problem that the data which needs to depend on marking cannot be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a relation extraction method suitable for small samples, so that time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.
In order to solve the above technical problem, the present invention provides a relationship extraction method adapted to small samples, comprising the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
Preferably, in the step (1), the acquiring of the training data specifically includes: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring triple group data (comprising relationship names and entity pairs) on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
Preferably, in the step (2), the training of the general domain relationship knowledge model specifically includes the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
Preferably, in the step (3), the specific domain relationship extraction model training specifically includes the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
The invention has the beneficial effects that: the method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.
Drawings
FIG. 1 is a diagram of a generic domain knowledge model structure according to the present invention.
FIG. 2 is a schematic diagram of the structure of the Transformer mechanism layer of the present invention.
FIG. 3 is a schematic diagram of a training expectation acquisition flow of the present invention.
FIG. 4 is a diagram illustrating relationship extraction in accordance with the specific embodiment of the present invention.
Detailed Description
A relation extraction method adapting to small samples comprises the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
In the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring triple group data (comprising relationship names and entity pairs) on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
In the step (2), the training of the general domain relationship knowledge model specifically comprises the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) setting parameters of network training: the Batchsize is 64, the initial learning rate is 1e-4, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, Epoch is 100, and the maximum sentence length is 200.
In the step (3), the specific domain relation extraction model training specifically comprises the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) setting parameters of network training: the Batchsize is 32, the initial learning rate is 1e-5, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, the Epoch is 30, and the maximum sentence length is 200.
The general domain relation knowledge model structure diagram is shown in FIG. 1. Converting sentences which need to be subjected to relation extraction into an Input format (word embedding + position embedding + segment embedding) of a Bert model, and transmitting the Input format (Input embedding) into the Bert model; the output of the Bert model is then passed into the model as input to the transform mechanism layer for training. The knowledge learned by the Transformer mechanism layer is the general domain relationship knowledge. The Bert model and the Transformer mechanism layer form a general domain relation knowledge model.
The schematic structure of the Transformer mechanism layer is shown in fig. 2. The layer mainly utilizes a Transformer mechanism to train the relation knowledge, and comprises two encoders, wherein each encoder consists of two sublayers: one is Multi-Head-attachment, which includes Multi-headed Self-attachment, where 6 heads are selected; the other is a feedforward neural network, which consists of a ReLU activation function and a Linear function.
Acquiring training corpora; in the training stage of the general domain relation knowledge model, the training corpus is mainly obtained through the following ways: 1) obtaining the annotation data by using a public KB (Knowledge Base) to align the annotation method of the naive text, and 2) manually annotating the data set in the prior public. The overall process of corpus acquisition is shown in fig. 3.
And acquiring the annotation data by using an open KB to align the annotation method of the plain text. Firstly, acquiring a public data set based on a public relation knowledge base such as Wikipedia and Freebase, and preprocessing data to obtain a KB (key B), namely a triple comprising a relation name and an entity pair; then, text data are obtained through a web crawler, and entities in the text are extracted through NLP preprocessing methods such as word segmentation and named entity recognition; and assigning the entity pairs and the relations to corresponding sentences in the text data by a remote supervision method. The same entity pair and for a sentence are packed into a bag, and the label of each bag is the relationship type.
Since the originally proposed remote supervision method assumed that if a sentence contained an entity pair involved in a relationship, that sentence was the relationship described. This inevitably leads to a lot of noise data, i.e., data in which the relationship judgment is erroneous, in the obtained labeling data. In order to reduce noise, feature extraction is carried out on each bag by using an unsupervised text clustering algorithm to be divided into two types, and positive examples are selected and reserved as training sets. And a late attention mechanism can be used to capture keywords and fight against noise data introduced from remote surveillance.
And extracting the relation of the specific field based on the general field relation knowledge model. On the basis of pre-training a general field relation knowledge model, only fine adjustment is needed, for example, a small number of samples in a specific field can be subjected to relation training by connecting a full connection layer, so that the problem of marking a large amount of data is avoided, and the network performance is superior. The specific network diagram is shown in fig. 4.
For example, the input sentence: the Lijiang ancient city is located in the Dazhen town of Lijiang city in Yunnan province, the middle part of Lijiang dam under Yulong snow mountain, Beiyi Xiangshan, Jinhong shan, West occipital lion mountain, and the southeast faces dozens of Litian woye.
1) Determining an entity pair: li jiang gucheng, li jiang city;
2) determining the positions [0,4], [10,13] of the sentences in which the entity pairs are located;
3) converting sentence information into Input Embeddings of a general domain relation knowledge model and inputting the Input Embeddings into a network;
4) and finally outputting a relation result: a geographic location.
Claims (4)
1. A relation extraction method adapting to small samples is characterized by comprising the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
2. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring ternary data on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
3. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (2), the training of the generic domain relationship knowledge model specifically comprises the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
4. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (3), the training of the domain-specific relationship extraction model specifically comprises the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911240521.3A CN111125370A (en) | 2019-12-06 | 2019-12-06 | Relation extraction method suitable for small samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911240521.3A CN111125370A (en) | 2019-12-06 | 2019-12-06 | Relation extraction method suitable for small samples |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111125370A true CN111125370A (en) | 2020-05-08 |
Family
ID=70497632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911240521.3A Pending CN111125370A (en) | 2019-12-06 | 2019-12-06 | Relation extraction method suitable for small samples |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111125370A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112668342A (en) * | 2021-01-08 | 2021-04-16 | 中国科学院自动化研究所 | Remote supervision relation extraction noise reduction system based on twin network |
CN113326371A (en) * | 2021-04-30 | 2021-08-31 | 南京大学 | Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information |
CN113807518A (en) * | 2021-08-16 | 2021-12-17 | 中央财经大学 | Relationship extraction system based on remote supervision |
WO2022036616A1 (en) * | 2020-08-20 | 2022-02-24 | 中山大学 | Method and apparatus for generating inferential question on basis of low labeled resource |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992629A (en) * | 2019-02-28 | 2019-07-09 | 中国科学院计算技术研究所 | A kind of neural network Relation extraction method and system of fusion entity type constraint |
CN110263158A (en) * | 2019-05-24 | 2019-09-20 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
-
2019
- 2019-12-06 CN CN201911240521.3A patent/CN111125370A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992629A (en) * | 2019-02-28 | 2019-07-09 | 中国科学院计算技术研究所 | A kind of neural network Relation extraction method and system of fusion entity type constraint |
CN110263158A (en) * | 2019-05-24 | 2019-09-20 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022036616A1 (en) * | 2020-08-20 | 2022-02-24 | 中山大学 | Method and apparatus for generating inferential question on basis of low labeled resource |
CN112668342A (en) * | 2021-01-08 | 2021-04-16 | 中国科学院自动化研究所 | Remote supervision relation extraction noise reduction system based on twin network |
CN112668342B (en) * | 2021-01-08 | 2024-05-07 | 中国科学院自动化研究所 | Remote supervision relation extraction noise reduction system based on twin network |
CN113326371A (en) * | 2021-04-30 | 2021-08-31 | 南京大学 | Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information |
CN113326371B (en) * | 2021-04-30 | 2023-12-29 | 南京大学 | Event extraction method integrating pre-training language model and anti-noise interference remote supervision information |
CN113807518A (en) * | 2021-08-16 | 2021-12-17 | 中央财经大学 | Relationship extraction system based on remote supervision |
CN113807518B (en) * | 2021-08-16 | 2024-04-05 | 中央财经大学 | Relation extraction system based on remote supervision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111144131B (en) | Network rumor detection method based on pre-training language model | |
CN111125370A (en) | Relation extraction method suitable for small samples | |
WO2018218705A1 (en) | Method for recognizing network text named entity based on neural network probability disambiguation | |
CN111209401A (en) | System and method for classifying and processing sentiment polarity of online public opinion text information | |
CN113420296B (en) | C source code vulnerability detection method based on Bert model and BiLSTM | |
CN109918635A (en) | A kind of contract text risk checking method, device, equipment and storage medium | |
CN111143563A (en) | Text classification method based on integration of BERT, LSTM and CNN | |
CN106919557A (en) | A kind of document vector generation method of combination topic model | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN110188359B (en) | Text entity extraction method | |
CN109871449A (en) | A kind of zero sample learning method end to end based on semantic description | |
CN114444481B (en) | Sentiment analysis and generation method of news comment | |
CN113934909A (en) | Financial event extraction method based on pre-training language and deep learning model | |
CN112307130A (en) | Document-level remote supervision relation extraction method and system | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN115510180A (en) | Multi-field-oriented complex event element extraction method | |
CN110287326A (en) | A kind of enterprise's sentiment analysis method with background description | |
CN112131879A (en) | Relationship extraction system, method and device | |
CN114298041A (en) | Network security named entity identification method and identification device | |
CN112949674A (en) | Multi-model fused corpus generation method and device | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model | |
CN109101499B (en) | Artificial intelligence voice learning method based on neural network | |
CN112926311B (en) | Unsupervised aspect word extraction method combining sequence and topic information | |
CN109062911B (en) | Artificial intelligent voice modeling method | |
CN113255330B (en) | Chinese spelling checking method based on character feature classifier and soft output |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |