CN113822330A - Noise reduction device and method based on natural language inference classification data set - Google Patents

Noise reduction device and method based on natural language inference classification data set Download PDF

Info

Publication number
CN113822330A
CN113822330A CN202110918801.6A CN202110918801A CN113822330A CN 113822330 A CN113822330 A CN 113822330A CN 202110918801 A CN202110918801 A CN 202110918801A CN 113822330 A CN113822330 A CN 113822330A
Authority
CN
China
Prior art keywords
natural language
data set
language inference
data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110918801.6A
Other languages
Chinese (zh)
Inventor
徐波
赵象三
宋晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN202110918801.6A priority Critical patent/CN113822330A/en
Publication of CN113822330A publication Critical patent/CN113822330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Abstract

The invention discloses a noise reduction device and method for deducing a classification data set based on natural language. The data format conversion module of the invention constructs templates for various characteristics in the relational classification data set, converts each triple in the relational classification data set into an assumption in natural language inference, and converts the corresponding text corpus into a premise in natural language inference; if high-quality labeled data can be divided from the original data set, the data set is directly used as a training set to train the natural language inference model by supervised learning, and if the noise ratio of the current data set is large and the manual labeling cost is high, the noise reduction effect of the current model on the verification set is used as feedback to train parameters of the natural language inference model by a reinforcement learning method; and the data set noise reduction template evaluates the relation classification data set obtained by remote supervision through a trained natural language inference model, and selects a data set with high confidence coefficient as a noise-reduced data set according to the score.

Description

Noise reduction device and method based on natural language inference classification data set
Technical Field
The invention belongs to the technical field of data processing methods, and particularly relates to a remote supervision relation classification data set noise reduction device and method based on natural language inference.
Background
The task of relationship classification is to predict the semantic relationship between two entities from a given text. Knowledge of the relationships of entities is essential to many downstream applications, such as knowledge graph completion and question answering tasks. The relational classification task typically relies on large-scale manual annotation data, which is expensive and time-consuming. To address this problem, remote supervision is often used to automatically annotate large volumes of corpora. Remote supervision is based on the assumption that: if a sentence contains an entity pair in the knowledge base, the entity pair in the sentence can be considered to have a relationship with the same entity pair in the knowledge base. Although this method can automatically obtain large-scale labeled data, it introduces noise problem.
Currently, there are two approaches to solving the noise problem of remotely supervised relational classification datasets. The first approach is to use multi-instance learning to tolerate the noise of the data set: the training data is divided into a number of packets, each packet containing a number of sentences that refer to the same pair of entities, and the model is then trained and tested in a packet-level fashion. However, such methods do not perform well in sentence-level prediction. The second method is to find noisy data directly: reinforcement learning or counterlearning is typically used to select high quality data or eliminate noisy data. However, their computational overhead is high and the performance of these methods is to be improved.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a remote supervision relation classification data set noise reduction device and method based on natural language inference. The method converts an original data set of relational classification into a natural language inference data set, uses a supervised learning training model when a large amount of supervised data which is correctly labeled and distributed in accordance with a full-scale remote supervised relational classification data set can be provided, or else uses reinforcement learning to train the model under the condition of not depending on labeled data required by the supervised learning, finally uses the natural language inference model to evaluate the relational classification data set, and selects high-quality data as an optimized data set according to evaluation scores.
The technical scheme of the invention is specifically introduced as follows:
the invention provides a noise reduction device for deducing a classification data set based on natural language, which comprises:
a data format conversion module for converting the classified data set into a natural language inferred data set;
the natural language inference model training module is used for training the converted natural language inference data set, the natural language inference model training module uses a supervised learning training model when being capable of providing a large amount of supervised data which are correctly labeled and are distributed in accordance with the full-scale remote supervision relation classification data set, and if the supervised data cannot be provided, the natural language inference model training module uses a reinforcement learning method to train the model;
and the data set noise reduction module is used for optimizing the remote supervision relation classification data set by utilizing the trained natural language inference model.
A noise reduction method for deducing a noise reduction device of a classification data set based on natural language comprises the following steps:
converting the format of the data set, converting the relation classification data set into a natural language inference data set;
training a natural language inference model, namely training the model by using a supervised learning training model when high-quality supervised data can be provided, and training the model by using a reinforcement learning method when the high-quality supervised data cannot be provided;
and denoising the data set, and optimizing the remote supervision relation classification data set through a trained natural language inference model.
The noise reduction device mainly comprises a data format conversion module, a natural language inference model training module and a data set noise reduction module; wherein: the data format conversion module converts each triple in the relational classification dataset into an assumption in natural language inference by constructing a template for each type of feature in the relational classification dataset and converts the corresponding text corpus into a premise in natural language inference; the natural language inference model training template is divided into two conditions, wherein one condition is that high-quality labeled data can be divided from an original data set, the data set can be directly used as a training set to train the natural language inference model by supervised learning, and the other condition is that the current data set has a large noise proportion and high manual labeling cost, and the noise reduction effect of the current model on a verification set can be used as feedback to train parameters of the natural language inference model by a reinforcement learning method; and the data set noise reduction template evaluates the relation classification data set obtained by remote supervision through a trained natural language inference model, and selects a data set with high confidence coefficient as a noise-reduced data set according to the score.
Compared with the prior art, the invention has the beneficial effects that:
1. the method provided by the invention belongs to the field of directly eliminating the noise in the remote supervision relation classification data set, and does not depend on the data format of a packet level, so that the data set can be optimized by using the data set noise reduction device provided by the invention in relation classification data sets of any forms.
2. Other methods of noise reduction using reinforcement learning simply use sentence coding and category coding for the coding of data to splice, but the present invention converts the noise discovery problem into a natural language inference problem: it is assumed that this cannot be deduced from the premises. There are currently many effective models for natural language inference problems. Therefore, the method has higher calculation efficiency and better effect.
Drawings
FIG. 1 is a detailed flow chart of the noise reduction method for a remote supervised relational classification dataset based on natural language inference of the present invention.
Fig. 2 is a template sample diagram of the data format conversion module of the present invention.
FIG. 3 is a schematic diagram of an example of a data noise reduction module of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following by combining the drawings and the examples.
A noise reduction device for a remote supervision relation classification data set based on natural language inference comprises the following modules:
(1) a data format conversion module: and respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, converting the triples in the relation classification into assumptions in natural language inference, and taking texts in the relation classification data set as the preconditions in the natural language inference to realize the construction of a natural language inference training set.
(2) A natural language inference model training module: when the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with a full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise, and the cost of obtaining a large-scale clean data set for the supervised learning is high, the reinforcement learning method is used for training the natural language inference model under the condition of not depending on labeled data required by the supervised learning.
(3) A dataset denoising module: and scoring the relation classification data set obtained by remote supervision through a trained natural language inference model, and selecting data with high confidence coefficient as an optimized data set according to the score.
A noise reduction method for classifying a data set noise reduction device based on a remote supervision relationship specifically comprises the following steps:
the method comprises the following steps: and converting the format of the data set. And constructing corresponding templates according to the semantics of various relations in the relation classification data set, then taking the texts in the relation classification data set as the premise in natural language inference, and converting the triples corresponding to the texts into the hypothesis in the natural language inference through the templates to realize the construction of the natural language inference training set.
Step two: and training a natural language inference model. When the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with the full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise and the cost of obtaining a large-scale clean data set for supervised learning is high, the reinforcement learning method is used for training the natural language inference model.
Step three: and a data set noise reduction step. And (4) scoring the remote supervision relation classification data set by using the natural language inference model trained in the step two, and then selecting data with high confidence coefficient as the optimized data set according to the score.
The invention converts the noise discovery problem of the remote supervision relational classification data set into the natural language inference problem, scores the relational classification data set by using the trained natural language inference model, and filters the data set based on the scores to obtain a clean data set.
A large-scale relational classification data set can be obtained through a remote supervision mode. Although the way in which the data set is constructed using remote supervision is quite efficient, the data set obtained in this way usually contains a lot of noise. The invention aims to find noise data in a data set and eliminate the noise data so as to obtain a high-quality relational classification data set.
As will be explained in detail below.
1. Data format conversion module
The format of the relation classification data set obtained by remote supervision is (h, t, r, text), wherein the text is a text, the h and the t are two entities in the text, and the r is the relation embodied in the text by the entity pair. The relational classified data is now converted into a natural language inferred data input format (P, H), where P is a premise and H is a hypothesis. The specific conversion method is as follows: firstly, respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, as shown in fig. 1, then converting the triples (H, t, r) in the original sample into the hypothesis H through the corresponding templates, and converting the text in the original sample into the hypothesis P. In this way, the natural language inference input format corresponding to the samples in all the original data sets can be obtained.
2. Natural language inference model training module
When a sufficiently high quality supervised data set is available, the natural language inference model can be trained by supervised learning directly using the natural language inference data set obtained by the data format conversion module.
When a large amount of supervised data which are correctly labeled and are distributed in accordance with a full-scale remote supervision relation classification data set cannot be obtained, the natural language inference model is trained by using reinforcement learning under the condition of not depending on labeled data required by supervision learning, and the specific training method comprises the following steps: as shown in fig. 2, a batch of original relational classification data sets are selected, format conversion is performed on the original relational classification data sets, then the original relational classification data sets are scored through a natural language inference model, and the corresponding relational classification data sets are selected according to the scores. And training a relation classifier by using the selected relation classification data set, taking the expression of the relation classifier on the verification set as the feedback of reinforcement learning, and updating the natural language inference model according to the selection result and the corresponding feedback. The specific parameter updating formula is as follows:
Figure RE-GDA0003371512430000051
wherein
Figure RE-GDA0003371512430000052
Refers to the parameters of the natural language inference model, beta refers to the learning rate, xvAnd rvInput and annotation, respectively, to a verification set, DBRefers to a batch of data in the training set, BsIt is referred to the size of the batch of data,
Figure RE-GDA0003371512430000053
refer to the current natural language inference model at DBThe selection of (1) is 0 or 1, where 1 indicates the data is left, 0 indicates the data is discarded, and fθIs using DBAdopt
Figure RE-GDA0003371512430000054
A relational classification model trained from the screened data set, F1 (F)θ(xv),rv) Is a relational classification model fθThe F1 values on the validation set, δ is the moving average of the F1 values,
Figure RE-GDA0003371512430000055
is a natural language inference model in parameters
Figure RE-GDA0003371512430000056
Lower pair DBAdopt
Figure RE-GDA0003371512430000057
Probability of screening.
3. Data set noise reduction module
The data set denoising module is used for scoring and screening a full amount of remote supervision relation classification data sets. The relational classification data set can be converted into a corresponding natural language inference model input format through the data format conversion module, then a fully trained natural language inference model can be obtained through the natural language inference model training module, and the data set is scored by using the natural language inference model. The lower the score is, the more the assumption cannot be derived from the premise, that is, the less the original data triple is found in the corresponding text, so that the data with the higher score is selected as the optimized data set, and the data with the lower score is removed from the original data set as noise. A specific example of this module is shown in fig. 3.

Claims (8)

1. A noise reduction apparatus for inferring a classification dataset based on natural language, comprising:
a data format conversion module for converting the classified data set into a natural language inferred data set;
the natural language inference model training module is used for training the converted natural language inference data set, the natural language inference model training module uses a supervised learning training model under the condition that a large amount of supervised data which are correctly labeled and are distributed in accordance with the full-scale remote supervision relation classification data set can be provided, and if the supervised data cannot be provided, a reinforced learning method is used for training the model;
and the data set noise reduction module is used for optimizing the remote supervision relation classification data set by utilizing the trained natural language inference model.
2. The noise reduction apparatus for natural language inference based on natural language inference classification dataset according to claim 1, wherein the data format conversion module constructs corresponding templates according to semantics of each type of relationship in the relationship classification dataset, converts triples in the relationship classification into assumptions in natural language inference, and implements construction of a natural language inference training set by using texts in the relationship classification dataset as preconditions in natural language inference.
3. The apparatus according to claim 1, wherein the natural language inference model training module uses a supervised learning training model when the raw data provides labeled data with correct label and consistent distribution with the full-scale remote supervised relational classification data set, and uses reinforcement learning to train the natural language inference model without relying on labeled data required by the supervised learning when the raw data contains a lot of noise.
4. The noise reduction device based on the natural language inference classification dataset of claim 1, wherein the dataset noise reduction module scores the relational classification dataset obtained by remote supervision through a trained natural language inference model, and selects data with high confidence as the optimized dataset according to the score.
5. The method of noise reduction based on natural language inference classification dataset noise reduction apparatus of claim 1, comprising the steps of:
converting the format of the data set, converting the relation classification data set into a natural language inference data set;
training a natural language inference model, namely using a supervised learning training model when a large amount of supervised data which are correctly labeled and are distributed in accordance with a full-scale remote supervision relation classification data set can be provided, and using a reinforcement learning method to train the model when the supervised data cannot be provided;
and denoising the data set, and optimizing the remote supervision relation classification data set through a trained natural language inference model.
6. The noise reduction method according to claim 5, wherein in the data format conversion process, corresponding templates are respectively constructed according to semantics of various types of relationships in the relationship classification dataset, triples in the relationship classification are converted into assumptions in natural language inference, and texts in the relationship classification dataset are used as preconditions in natural language inference to realize construction of a natural language inference training set.
7. The method of reducing noise according to claim 5, wherein the training of the natural language inference model is performed by supervised learning when the raw data provides a large amount of supervised data with correct label and consistent distribution with the full-scale remote supervised relational classification data set, and training the natural language inference model by reinforcement learning when the raw data contains a large amount of noise and the cost of obtaining a large-scale clean data set for supervised learning is high.
8. The noise reduction method according to claim 5, wherein in the noise reduction process of the data set, the trained natural language inference model is used for scoring the relation classification data set obtained by remote supervision, and data with high confidence coefficient is selected as the optimized data set according to the score.
CN202110918801.6A 2021-08-11 2021-08-11 Noise reduction device and method based on natural language inference classification data set Pending CN113822330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110918801.6A CN113822330A (en) 2021-08-11 2021-08-11 Noise reduction device and method based on natural language inference classification data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110918801.6A CN113822330A (en) 2021-08-11 2021-08-11 Noise reduction device and method based on natural language inference classification data set

Publications (1)

Publication Number Publication Date
CN113822330A true CN113822330A (en) 2021-12-21

Family

ID=78913098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110918801.6A Pending CN113822330A (en) 2021-08-11 2021-08-11 Noise reduction device and method based on natural language inference classification data set

Country Status (1)

Country Link
CN (1) CN113822330A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536522B1 (en) * 2013-12-30 2017-01-03 Google Inc. Training a natural language processing model with information retrieval model annotations
CN109766546A (en) * 2018-12-25 2019-05-17 华东师范大学 A kind of natural language inference method neural network based
CN110209836A (en) * 2019-05-17 2019-09-06 北京邮电大学 Remote supervisory Relation extraction method and device
US20200279108A1 (en) * 2019-03-01 2020-09-03 Iqvia Inc. Automated classification and interpretation of life science documents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536522B1 (en) * 2013-12-30 2017-01-03 Google Inc. Training a natural language processing model with information retrieval model annotations
CN109766546A (en) * 2018-12-25 2019-05-17 华东师范大学 A kind of natural language inference method neural network based
US20200279108A1 (en) * 2019-03-01 2020-09-03 Iqvia Inc. Automated classification and interpretation of life science documents
CN110209836A (en) * 2019-05-17 2019-09-06 北京邮电大学 Remote supervisory Relation extraction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BO XU, XIANGSAN ZHAO, CHAOFENG SHA, MINJUN ZHANG, HUI SONG: "Reinforced Natural Language Inference for Distantly Supervised Relation Classification", pages 364 - 376, XP047596626, Retrieved from the Internet <URL:https://link.springer.com/chapter/10.1007/978-3-030-75768-7_29> DOI: 10.1007/978-3-030-75768-7_29 *

Similar Documents

Publication Publication Date Title
CN108984683B (en) Method, system, equipment and storage medium for extracting structured data
CN110298032B (en) Text classification corpus labeling training system
WO2018218708A1 (en) Deep-learning-based public opinion hotspot category classification method
CN109635108B (en) Man-machine interaction based remote supervision entity relationship extraction method
CN112487143A (en) Public opinion big data analysis-based multi-label text classification method
CN112269868B (en) Use method of machine reading understanding model based on multi-task joint training
CN116629275B (en) Intelligent decision support system and method based on big data
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN110866113B (en) Text classification method based on sparse self-attention mechanism fine-tuning burt model
CN112883153B (en) Relationship classification method and device based on information enhancement BERT
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN111125370A (en) Relation extraction method suitable for small samples
CN111368563A (en) Clustering algorithm fused dimension-Chinese machine translation system
CN114064915A (en) Domain knowledge graph construction method and system based on rules and deep learning
CN113822330A (en) Noise reduction device and method based on natural language inference classification data set
CN112257432A (en) Self-adaptive intention identification method and device and electronic equipment
CN116432664A (en) Dialogue intention classification method and system for high-quality data amplification
CN117033961A (en) Multi-mode image-text classification method for context awareness
CN116595169A (en) Question-answer intention classification method for coal mine production field based on prompt learning
CN111708896B (en) Entity relationship extraction method applied to biomedical literature
CN114372138A (en) Electric power field relation extraction method based on shortest dependence path and BERT
CN113220892A (en) BERT-based self-adaptive text classification method and device
CN111680163A (en) Knowledge graph visualization method for electric power scientific and technological achievements
CN110070093B (en) Remote supervision relation extraction denoising method based on countermeasure learning
CN114936283B (en) Network public opinion analysis method based on Bert

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination