CN111125370A - Relation extraction method suitable for small samples - Google Patents

Relation extraction method suitable for small samples Download PDF

Info

Publication number
CN111125370A
CN111125370A CN201911240521.3A CN201911240521A CN111125370A CN 111125370 A CN111125370 A CN 111125370A CN 201911240521 A CN201911240521 A CN 201911240521A CN 111125370 A CN111125370 A CN 111125370A
Authority
CN
China
Prior art keywords
training
data
relation
model
knowledge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911240521.3A
Other languages
Chinese (zh)
Inventor
卓可秋
杨秀燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING SINOVATIO TECHNOLOGY CO LTD
Original Assignee
NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING SINOVATIO TECHNOLOGY CO LTD filed Critical NANJING SINOVATIO TECHNOLOGY CO LTD
Priority to CN201911240521.3A priority Critical patent/CN111125370A/en
Publication of CN111125370A publication Critical patent/CN111125370A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a relation extraction method suitable for small samples, which comprises the following steps: (1) acquiring training data; (2) training a general domain relation knowledge model; (3) and training a specific domain relation extraction model. The method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.

Description

Relation extraction method suitable for small samples
Technical Field
The invention relates to the technical field of knowledge graph construction, in particular to a relation extraction method suitable for small samples.
Background
Information extraction is an important component in natural language processing, and especially in the current information society, it is particularly significant to extract useful information from mass data. The information extraction can be divided into entity extraction, relationship extraction, event extraction and the like. The relation extraction is to judge whether a certain relation exists in the entity pair on the basis of the extracted entity pair, and the no relation is also taken as a special relation.
As the relationship extraction is shifted from the defined relationship types to the various relationship types in the open domain, the data source is shifted from the standard corpus to massive network data. The traditional method based on pattern matching cannot adapt to various expression forms of a large number of relation types, and all patterns are difficult to define manually. Machine learning and deep learning methods based on supervised learning need to manually label large-scale experimental corpora, and have high accuracy. But the cost of the needed manual labeling is too large, so that the training data is extremely lack. Therefore, research direction turns to that the performance of model training can be still improved when the amount of manual labeling data is reduced.
In 2009, Mintz et al proposed a remote supervision method, and the basic idea was to rely on an existing knowledge base, obtain a text containing an entity pair in the knowledge base from the text as a training corpus, and Mintz proposed an assumption that if a certain relationship of a certain entity pair exists in the knowledge base, all data containing the pair of entities express the relationship. Although this approach can greatly reduce the trouble of manual labeling, remote surveillance also poses a negative problem in that tag data has a large amount of noisy data. This is exactly the problem that remote supervision needs to solve. In 2018, Google provides a strong Bert model, the performance is excellent when processing downstream tasks, and other corresponding tasks can be processed only by connecting a network layer behind the Bert model for fine adjustment. However, for various relation types, in the process of relation extraction in a specific field, even if basic language knowledge can be learned by using a Bert model to improve the extraction result, the problem that the data which needs to depend on marking cannot be solved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a relation extraction method suitable for small samples, so that time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.
In order to solve the above technical problem, the present invention provides a relationship extraction method adapted to small samples, comprising the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
Preferably, in the step (1), the acquiring of the training data specifically includes: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring triple group data (comprising relationship names and entity pairs) on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
Preferably, in the step (2), the training of the general domain relationship knowledge model specifically includes the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
Preferably, in the step (3), the specific domain relationship extraction model training specifically includes the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
The invention has the beneficial effects that: the method comprises the steps of acquiring common knowledge contained in various relations by using a general field relation knowledge module, automatically generating samples based on remote supervision by using an open-source knowledge graph, and training general and specific field relation knowledge models by combining unsupervised noise reduction data; a general domain relation knowledge module is adopted to learn common knowledge contained in various relations; automatically generating training samples based on remote supervision and reducing noise data by combining unsupervised data so as to reduce manual labeling data; when the relation knowledge model is generated, a large amount of manual marking data does not need to be acquired, time and money consumption caused by a large amount of manual marking is avoided, and a relation extraction task in a specific field can be completed through a small amount of marking data in the specific field.
Drawings
FIG. 1 is a diagram of a generic domain knowledge model structure according to the present invention.
FIG. 2 is a schematic diagram of the structure of the Transformer mechanism layer of the present invention.
FIG. 3 is a schematic diagram of a training expectation acquisition flow of the present invention.
FIG. 4 is a diagram illustrating relationship extraction in accordance with the specific embodiment of the present invention.
Detailed Description
A relation extraction method adapting to small samples comprises the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
In the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring triple group data (comprising relationship names and entity pairs) on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
In the step (2), the training of the general domain relationship knowledge model specifically comprises the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) setting parameters of network training: the Batchsize is 64, the initial learning rate is 1e-4, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, Epoch is 100, and the maximum sentence length is 200.
In the step (3), the specific domain relation extraction model training specifically comprises the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) setting parameters of network training: the Batchsize is 32, the initial learning rate is 1e-5, the optimizer is BertAdma, the overfitting prevention strategy is L2 regular, the Epoch is 30, and the maximum sentence length is 200.
The general domain relation knowledge model structure diagram is shown in FIG. 1. Converting sentences which need to be subjected to relation extraction into an Input format (word embedding + position embedding + segment embedding) of a Bert model, and transmitting the Input format (Input embedding) into the Bert model; the output of the Bert model is then passed into the model as input to the transform mechanism layer for training. The knowledge learned by the Transformer mechanism layer is the general domain relationship knowledge. The Bert model and the Transformer mechanism layer form a general domain relation knowledge model.
The schematic structure of the Transformer mechanism layer is shown in fig. 2. The layer mainly utilizes a Transformer mechanism to train the relation knowledge, and comprises two encoders, wherein each encoder consists of two sublayers: one is Multi-Head-attachment, which includes Multi-headed Self-attachment, where 6 heads are selected; the other is a feedforward neural network, which consists of a ReLU activation function and a Linear function.
Acquiring training corpora; in the training stage of the general domain relation knowledge model, the training corpus is mainly obtained through the following ways: 1) obtaining the annotation data by using a public KB (Knowledge Base) to align the annotation method of the naive text, and 2) manually annotating the data set in the prior public. The overall process of corpus acquisition is shown in fig. 3.
And acquiring the annotation data by using an open KB to align the annotation method of the plain text. Firstly, acquiring a public data set based on a public relation knowledge base such as Wikipedia and Freebase, and preprocessing data to obtain a KB (key B), namely a triple comprising a relation name and an entity pair; then, text data are obtained through a web crawler, and entities in the text are extracted through NLP preprocessing methods such as word segmentation and named entity recognition; and assigning the entity pairs and the relations to corresponding sentences in the text data by a remote supervision method. The same entity pair and for a sentence are packed into a bag, and the label of each bag is the relationship type.
Since the originally proposed remote supervision method assumed that if a sentence contained an entity pair involved in a relationship, that sentence was the relationship described. This inevitably leads to a lot of noise data, i.e., data in which the relationship judgment is erroneous, in the obtained labeling data. In order to reduce noise, feature extraction is carried out on each bag by using an unsupervised text clustering algorithm to be divided into two types, and positive examples are selected and reserved as training sets. And a late attention mechanism can be used to capture keywords and fight against noise data introduced from remote surveillance.
And extracting the relation of the specific field based on the general field relation knowledge model. On the basis of pre-training a general field relation knowledge model, only fine adjustment is needed, for example, a small number of samples in a specific field can be subjected to relation training by connecting a full connection layer, so that the problem of marking a large amount of data is avoided, and the network performance is superior. The specific network diagram is shown in fig. 4.
For example, the input sentence: the Lijiang ancient city is located in the Dazhen town of Lijiang city in Yunnan province, the middle part of Lijiang dam under Yulong snow mountain, Beiyi Xiangshan, Jinhong shan, West occipital lion mountain, and the southeast faces dozens of Litian woye.
1) Determining an entity pair: li jiang gucheng, li jiang city;
2) determining the positions [0,4], [10,13] of the sentences in which the entity pairs are located;
3) converting sentence information into Input Embeddings of a general domain relation knowledge model and inputting the Input Embeddings into a network;
4) and finally outputting a relation result: a geographic location.

Claims (4)

1. A relation extraction method adapting to small samples is characterized by comprising the following steps:
(1) acquiring training data;
(2) training a general domain relation knowledge model;
(3) and training a specific domain relation extraction model.
2. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (1), the obtaining of the training data specifically comprises: the training data is from two parts, namely public relation labeling data and training data generated based on weak supervision;
(11) collecting unformatted text data on the Internet by using a crawler tool;
(12) acquiring ternary data on a public data set Freebase;
(13) acquiring entity pairs and corresponding sentences of the entity pairs by using text data through a named entity recognition method of NLP;
(14) the entity pairs and corresponding sentences are endowed with a relationship by using a remote supervision method, and the same entity pairs and the corresponding sentences are placed in a bag;
(15) and judging the data in the bag into positive examples and negative examples by using an unsupervised text clustering method, and keeping the positive examples as a training set.
3. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (2), the training of the generic domain relationship knowledge model specifically comprises the following steps:
(21) preprocessing training set data: acquiring position information of an entity pair in a sentence, and transmitting the position information and word information into a model as input;
(22) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
4. The method for extracting relationship adaptive to small samples according to claim 1, wherein in the step (3), the training of the domain-specific relationship extraction model specifically comprises the following steps:
(31) performing fine tuning training on the basis of a pre-trained general field relation knowledge model, and connecting a linear layer to an output layer of the relation knowledge model;
(32) the data structure type of the transmitted model and a training process of a general domain relation knowledge model;
(33) the parameter setting of the network training comprises the following steps: batch size Batchsize, initial learning rate, optimizer, anti-overfitting strategy, iteration number Epoch, and maximum sentence length.
CN201911240521.3A 2019-12-06 2019-12-06 Relation extraction method suitable for small samples Pending CN111125370A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911240521.3A CN111125370A (en) 2019-12-06 2019-12-06 Relation extraction method suitable for small samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911240521.3A CN111125370A (en) 2019-12-06 2019-12-06 Relation extraction method suitable for small samples

Publications (1)

Publication Number Publication Date
CN111125370A true CN111125370A (en) 2020-05-08

Family

ID=70497632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911240521.3A Pending CN111125370A (en) 2019-12-06 2019-12-06 Relation extraction method suitable for small samples

Country Status (1)

Country Link
CN (1) CN111125370A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668342A (en) * 2021-01-08 2021-04-16 中国科学院自动化研究所 Remote supervision relation extraction noise reduction system based on twin network
CN113326371A (en) * 2021-04-30 2021-08-31 南京大学 Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information
CN113807518A (en) * 2021-08-16 2021-12-17 中央财经大学 Relationship extraction system based on remote supervision
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992629A (en) * 2019-02-28 2019-07-09 中国科学院计算技术研究所 A kind of neural network Relation extraction method and system of fusion entity type constraint
CN110263158A (en) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 A kind of processing method of data, device and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992629A (en) * 2019-02-28 2019-07-09 中国科学院计算技术研究所 A kind of neural network Relation extraction method and system of fusion entity type constraint
CN110263158A (en) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 A kind of processing method of data, device and equipment

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN112668342A (en) * 2021-01-08 2021-04-16 中国科学院自动化研究所 Remote supervision relation extraction noise reduction system based on twin network
CN112668342B (en) * 2021-01-08 2024-05-07 中国科学院自动化研究所 Remote supervision relation extraction noise reduction system based on twin network
CN113326371A (en) * 2021-04-30 2021-08-31 南京大学 Event extraction method fusing pre-training language model and anti-noise interference remote monitoring information
CN113326371B (en) * 2021-04-30 2023-12-29 南京大学 Event extraction method integrating pre-training language model and anti-noise interference remote supervision information
CN113807518A (en) * 2021-08-16 2021-12-17 中央财经大学 Relationship extraction system based on remote supervision
CN113807518B (en) * 2021-08-16 2024-04-05 中央财经大学 Relation extraction system based on remote supervision

Similar Documents

Publication Publication Date Title
CN111144131B (en) Network rumor detection method based on pre-training language model
CN111125370A (en) Relation extraction method suitable for small samples
WO2018218705A1 (en) Method for recognizing network text named entity based on neural network probability disambiguation
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN113420296B (en) C source code vulnerability detection method based on Bert model and BiLSTM
CN109918635A (en) A kind of contract text risk checking method, device, equipment and storage medium
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN106919557A (en) A kind of document vector generation method of combination topic model
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN110188359B (en) Text entity extraction method
CN109871449A (en) A kind of zero sample learning method end to end based on semantic description
CN114444481B (en) Sentiment analysis and generation method of news comment
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN112307130A (en) Document-level remote supervision relation extraction method and system
CN111967267A (en) XLNET-based news text region extraction method and system
CN115510180A (en) Multi-field-oriented complex event element extraction method
CN110287326A (en) A kind of enterprise's sentiment analysis method with background description
CN112131879A (en) Relationship extraction system, method and device
CN114298041A (en) Network security named entity identification method and identification device
CN112949674A (en) Multi-model fused corpus generation method and device
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN109101499B (en) Artificial intelligence voice learning method based on neural network
CN112926311B (en) Unsupervised aspect word extraction method combining sequence and topic information
CN109062911B (en) Artificial intelligent voice modeling method
CN113255330B (en) Chinese spelling checking method based on character feature classifier and soft output

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination