CN113822330A

CN113822330A - Noise reduction device and method based on natural language inference classification data set

Info

Publication number: CN113822330A
Application number: CN202110918801.6A
Authority: CN
Inventors: 徐波; 赵象三; 宋晖
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-21

Abstract

The invention discloses a noise reduction device and method for deducing a classification data set based on natural language. The data format conversion module of the invention constructs templates for various characteristics in the relational classification data set, converts each triple in the relational classification data set into an assumption in natural language inference, and converts the corresponding text corpus into a premise in natural language inference; if high-quality labeled data can be divided from the original data set, the data set is directly used as a training set to train the natural language inference model by supervised learning, and if the noise ratio of the current data set is large and the manual labeling cost is high, the noise reduction effect of the current model on the verification set is used as feedback to train parameters of the natural language inference model by a reinforcement learning method; and the data set noise reduction template evaluates the relation classification data set obtained by remote supervision through a trained natural language inference model, and selects a data set with high confidence coefficient as a noise-reduced data set according to the score.

Description

Noise reduction device and method based on natural language inference classification data set

Technical Field

The invention belongs to the technical field of data processing methods, and particularly relates to a remote supervision relation classification data set noise reduction device and method based on natural language inference.

Background

The task of relationship classification is to predict the semantic relationship between two entities from a given text. Knowledge of the relationships of entities is essential to many downstream applications, such as knowledge graph completion and question answering tasks. The relational classification task typically relies on large-scale manual annotation data, which is expensive and time-consuming. To address this problem, remote supervision is often used to automatically annotate large volumes of corpora. Remote supervision is based on the assumption that: if a sentence contains an entity pair in the knowledge base, the entity pair in the sentence can be considered to have a relationship with the same entity pair in the knowledge base. Although this method can automatically obtain large-scale labeled data, it introduces noise problem.

Currently, there are two approaches to solving the noise problem of remotely supervised relational classification datasets. The first approach is to use multi-instance learning to tolerate the noise of the data set: the training data is divided into a number of packets, each packet containing a number of sentences that refer to the same pair of entities, and the model is then trained and tested in a packet-level fashion. However, such methods do not perform well in sentence-level prediction. The second method is to find noisy data directly: reinforcement learning or counterlearning is typically used to select high quality data or eliminate noisy data. However, their computational overhead is high and the performance of these methods is to be improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a remote supervision relation classification data set noise reduction device and method based on natural language inference. The method converts an original data set of relational classification into a natural language inference data set, uses a supervised learning training model when a large amount of supervised data which is correctly labeled and distributed in accordance with a full-scale remote supervised relational classification data set can be provided, or else uses reinforcement learning to train the model under the condition of not depending on labeled data required by the supervised learning, finally uses the natural language inference model to evaluate the relational classification data set, and selects high-quality data as an optimized data set according to evaluation scores.

The technical scheme of the invention is specifically introduced as follows:

the invention provides a noise reduction device for deducing a classification data set based on natural language, which comprises:

a data format conversion module for converting the classified data set into a natural language inferred data set;

the natural language inference model training module is used for training the converted natural language inference data set, the natural language inference model training module uses a supervised learning training model when being capable of providing a large amount of supervised data which are correctly labeled and are distributed in accordance with the full-scale remote supervision relation classification data set, and if the supervised data cannot be provided, the natural language inference model training module uses a reinforcement learning method to train the model;

and the data set noise reduction module is used for optimizing the remote supervision relation classification data set by utilizing the trained natural language inference model.

A noise reduction method for deducing a noise reduction device of a classification data set based on natural language comprises the following steps:

converting the format of the data set, converting the relation classification data set into a natural language inference data set;

training a natural language inference model, namely training the model by using a supervised learning training model when high-quality supervised data can be provided, and training the model by using a reinforcement learning method when the high-quality supervised data cannot be provided;

and denoising the data set, and optimizing the remote supervision relation classification data set through a trained natural language inference model.

The noise reduction device mainly comprises a data format conversion module, a natural language inference model training module and a data set noise reduction module; wherein: the data format conversion module converts each triple in the relational classification dataset into an assumption in natural language inference by constructing a template for each type of feature in the relational classification dataset and converts the corresponding text corpus into a premise in natural language inference; the natural language inference model training template is divided into two conditions, wherein one condition is that high-quality labeled data can be divided from an original data set, the data set can be directly used as a training set to train the natural language inference model by supervised learning, and the other condition is that the current data set has a large noise proportion and high manual labeling cost, and the noise reduction effect of the current model on a verification set can be used as feedback to train parameters of the natural language inference model by a reinforcement learning method; and the data set noise reduction template evaluates the relation classification data set obtained by remote supervision through a trained natural language inference model, and selects a data set with high confidence coefficient as a noise-reduced data set according to the score.

Compared with the prior art, the invention has the beneficial effects that:

1. the method provided by the invention belongs to the field of directly eliminating the noise in the remote supervision relation classification data set, and does not depend on the data format of a packet level, so that the data set can be optimized by using the data set noise reduction device provided by the invention in relation classification data sets of any forms.

2. Other methods of noise reduction using reinforcement learning simply use sentence coding and category coding for the coding of data to splice, but the present invention converts the noise discovery problem into a natural language inference problem: it is assumed that this cannot be deduced from the premises. There are currently many effective models for natural language inference problems. Therefore, the method has higher calculation efficiency and better effect.

Drawings

FIG. 1 is a detailed flow chart of the noise reduction method for a remote supervised relational classification dataset based on natural language inference of the present invention.

Fig. 2 is a template sample diagram of the data format conversion module of the present invention.

FIG. 3 is a schematic diagram of an example of a data noise reduction module of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following by combining the drawings and the examples.

A noise reduction device for a remote supervision relation classification data set based on natural language inference comprises the following modules:

(1) a data format conversion module: and respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, converting the triples in the relation classification into assumptions in natural language inference, and taking texts in the relation classification data set as the preconditions in the natural language inference to realize the construction of a natural language inference training set.

(2) A natural language inference model training module: when the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with a full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise, and the cost of obtaining a large-scale clean data set for the supervised learning is high, the reinforcement learning method is used for training the natural language inference model under the condition of not depending on labeled data required by the supervised learning.

(3) A dataset denoising module: and scoring the relation classification data set obtained by remote supervision through a trained natural language inference model, and selecting data with high confidence coefficient as an optimized data set according to the score.

A noise reduction method for classifying a data set noise reduction device based on a remote supervision relationship specifically comprises the following steps:

the method comprises the following steps: and converting the format of the data set. And constructing corresponding templates according to the semantics of various relations in the relation classification data set, then taking the texts in the relation classification data set as the premise in natural language inference, and converting the triples corresponding to the texts into the hypothesis in the natural language inference through the templates to realize the construction of the natural language inference training set.

Step two: and training a natural language inference model. When the original data can provide a large amount of supervised data which is correctly labeled and is distributed in accordance with the full-scale remote supervision relation classification data set, the supervised learning training model can be used, and when the original data contains a large amount of noise and the cost of obtaining a large-scale clean data set for supervised learning is high, the reinforcement learning method is used for training the natural language inference model.

Step three: and a data set noise reduction step. And (4) scoring the remote supervision relation classification data set by using the natural language inference model trained in the step two, and then selecting data with high confidence coefficient as the optimized data set according to the score.

The invention converts the noise discovery problem of the remote supervision relational classification data set into the natural language inference problem, scores the relational classification data set by using the trained natural language inference model, and filters the data set based on the scores to obtain a clean data set.

A large-scale relational classification data set can be obtained through a remote supervision mode. Although the way in which the data set is constructed using remote supervision is quite efficient, the data set obtained in this way usually contains a lot of noise. The invention aims to find noise data in a data set and eliminate the noise data so as to obtain a high-quality relational classification data set.

As will be explained in detail below.

1. Data format conversion module

The format of the relation classification data set obtained by remote supervision is (h, t, r, text), wherein the text is a text, the h and the t are two entities in the text, and the r is the relation embodied in the text by the entity pair. The relational classified data is now converted into a natural language inferred data input format (P, H), where P is a premise and H is a hypothesis. The specific conversion method is as follows: firstly, respectively constructing corresponding templates according to the semantics of various relations in the relation classification data set, as shown in fig. 1, then converting the triples (H, t, r) in the original sample into the hypothesis H through the corresponding templates, and converting the text in the original sample into the hypothesis P. In this way, the natural language inference input format corresponding to the samples in all the original data sets can be obtained.

2. Natural language inference model training module

When a sufficiently high quality supervised data set is available, the natural language inference model can be trained by supervised learning directly using the natural language inference data set obtained by the data format conversion module.

When a large amount of supervised data which are correctly labeled and are distributed in accordance with a full-scale remote supervision relation classification data set cannot be obtained, the natural language inference model is trained by using reinforcement learning under the condition of not depending on labeled data required by supervision learning, and the specific training method comprises the following steps: as shown in fig. 2, a batch of original relational classification data sets are selected, format conversion is performed on the original relational classification data sets, then the original relational classification data sets are scored through a natural language inference model, and the corresponding relational classification data sets are selected according to the scores. And training a relation classifier by using the selected relation classification data set, taking the expression of the relation classifier on the verification set as the feedback of reinforcement learning, and updating the natural language inference model according to the selection result and the corresponding feedback. The specific parameter updating formula is as follows:

wherein

Refers to the parameters of the natural language inference model, beta refers to the learning rate, x^vAnd r^vInput and annotation, respectively, to a verification set, D^BRefers to a batch of data in the training set, B_sIt is referred to the size of the batch of data,

refer to the current natural language inference model at D^BThe selection of (1) is 0 or 1, where 1 indicates the data is left, 0 indicates the data is discarded, and f_θIs using D^BAdopt

A relational classification model trained from the screened data set, F1 (F)_θ(x^v),r^v) Is a relational classification model f_θThe F1 values on the validation set, δ is the moving average of the F1 values,

is a natural language inference model in parameters

Lower pair D^BAdopt

Probability of screening.

3. Data set noise reduction module

The data set denoising module is used for scoring and screening a full amount of remote supervision relation classification data sets. The relational classification data set can be converted into a corresponding natural language inference model input format through the data format conversion module, then a fully trained natural language inference model can be obtained through the natural language inference model training module, and the data set is scored by using the natural language inference model. The lower the score is, the more the assumption cannot be derived from the premise, that is, the less the original data triple is found in the corresponding text, so that the data with the higher score is selected as the optimized data set, and the data with the lower score is removed from the original data set as noise. A specific example of this module is shown in fig. 3.

Claims

1. A noise reduction apparatus for inferring a classification dataset based on natural language, comprising:

the natural language inference model training module is used for training the converted natural language inference data set, the natural language inference model training module uses a supervised learning training model under the condition that a large amount of supervised data which are correctly labeled and are distributed in accordance with the full-scale remote supervision relation classification data set can be provided, and if the supervised data cannot be provided, a reinforced learning method is used for training the model;

2. The noise reduction apparatus for natural language inference based on natural language inference classification dataset according to claim 1, wherein the data format conversion module constructs corresponding templates according to semantics of each type of relationship in the relationship classification dataset, converts triples in the relationship classification into assumptions in natural language inference, and implements construction of a natural language inference training set by using texts in the relationship classification dataset as preconditions in natural language inference.

3. The apparatus according to claim 1, wherein the natural language inference model training module uses a supervised learning training model when the raw data provides labeled data with correct label and consistent distribution with the full-scale remote supervised relational classification data set, and uses reinforcement learning to train the natural language inference model without relying on labeled data required by the supervised learning when the raw data contains a lot of noise.

4. The noise reduction device based on the natural language inference classification dataset of claim 1, wherein the dataset noise reduction module scores the relational classification dataset obtained by remote supervision through a trained natural language inference model, and selects data with high confidence as the optimized dataset according to the score.

5. The method of noise reduction based on natural language inference classification dataset noise reduction apparatus of claim 1, comprising the steps of:

training a natural language inference model, namely using a supervised learning training model when a large amount of supervised data which are correctly labeled and are distributed in accordance with a full-scale remote supervision relation classification data set can be provided, and using a reinforcement learning method to train the model when the supervised data cannot be provided;

6. The noise reduction method according to claim 5, wherein in the data format conversion process, corresponding templates are respectively constructed according to semantics of various types of relationships in the relationship classification dataset, triples in the relationship classification are converted into assumptions in natural language inference, and texts in the relationship classification dataset are used as preconditions in natural language inference to realize construction of a natural language inference training set.

7. The method of reducing noise according to claim 5, wherein the training of the natural language inference model is performed by supervised learning when the raw data provides a large amount of supervised data with correct label and consistent distribution with the full-scale remote supervised relational classification data set, and training the natural language inference model by reinforcement learning when the raw data contains a large amount of noise and the cost of obtaining a large-scale clean data set for supervised learning is high.

8. The noise reduction method according to claim 5, wherein in the noise reduction process of the data set, the trained natural language inference model is used for scoring the relation classification data set obtained by remote supervision, and data with high confidence coefficient is selected as the optimized data set according to the score.