CN111178387A

CN111178387A - Label noise detection method based on multi-granularity relative density

Info

Publication number: CN111178387A
Application number: CN201911222298.XA
Authority: CN
Inventors: 夏书银; 梁潇; 刘群; 王炳贵; 陈百云
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-05-19

Abstract

The invention discloses a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification. The method comprises the following steps: s1: the label noise detection method based on the multi-granularity relative density divides a data set into K clusters by using a KMeans algorithm, and calculates the improved relative density of each sample on granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distances from the samples to the same-class mass center and the heterogeneous mass center are respectively calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: changing the K value, repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.

Description

Label noise detection method based on multi-granularity relative density

Technical Field

The invention relates to a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification.

Background

Real world data is always defective and the presence of noisy data is a consequence of such defects. Noise processing is an important task in machine learning. In the classification problem, noise is mainly classified into two categories: attribute noise and tag noise. Attribute noise is caused by errors in the process of inputting attributes, while tag noise is caused by tag contamination. Generally, tag noise may be more harmful than attribute noise, and first, a sample may have multiple features, whereas only one tag is present. Second, although each feature has its own unique importance, the impact of the tag on learning is always greater. The performance of the classifier is reduced by the presence of label noise and the complexity of the model is increased. In addition, there is also a large difference between the tag noise and the outlier noise. Outlier detection is an important step in many data mining tasks, however, sometimes outlier data samples have no effect on the classification results, while the label noise is different. In many practical scenarios, it has proven beneficial to detect and process noise, which has become an important area of machine learning.

Tag noise refers to noise whose tag is not properly recorded, mainly caused by insufficient information, coding and communication errors. In fact, the presence of noise is common. First, the information provided to experts may not be sufficient for them to perform reliable labeling. For example, in many interactive Web programs, the tags for the data are obtained through real-time feedback from the user. In the medical field, the detection results are often unknown and incomplete, and sometimes the information described by the medical language may be too limited and not much information is available, and the incompleteness of such information may also cause label noise. Furthermore, in some cases, the quality of the information is poor or the accuracy of the information is uncertain, for example, the patient may respond incorrectly or incorrectly during the illness, and even if the patient is asked with repeated questions, the patient's feedback may be answered differently, which also easily leads to the occurrence of tag noise. Second, the manual label itself may be in error. Of course, errors in such classification are not always caused by human experts, as automatic classification devices are now also used in different scenarios. Furthermore, the phenomenon of obtaining labels from experienced professionals is common, since collecting reliable labels is a time consuming and expensive task, but the labels provided according to the expertise are less reliable. Third, when the tagging task is subjective behavior of an individual, for example, in a medical image data analysis application, some experts may make some changes to the tag according to actual conditions, and may also cause tag noise to appear. Fourth, tag noise may also come only from data coding or communication problems.

The tag noise affects the learning performance of the classifier in different ways. Even if the tag noise is uniformly distributed, most loss function derivations are not robust to the tag noise, including exponential loss of AdaBoost, logarithmic loss of logistic regression, and hinge loss of SVM. AdaBoost takes a lot of time to learn the tag noise because AdaBoost increases the weight of misclassified samples. In addition, the tag noise also affects the learning process of the BP neural network and the decision tree, and the selection of k values in the kNN algorithm and the selection of kernel parameters in the kernel classifier.

The existence of noise samples seriously affects the efficiency of data mining modeling and can even cause deviation of mining results. Noise samples often make the classifier inefficient, cause overfitting, and seriously affect the performance of the classifier. Therefore, a series of pre-processing of the data is required before classifier learning. In order to apply label noise detection to classifier optimization, it is most important to efficiently detect label noise. To date, two major label noise detection techniques have been extensively studied, namely classifier-based and distance measurement-based label noise detection techniques. With the first detection method, since it is difficult to efficiently and reliably ensure detection of tag noise using a specific classifier because of relying on learning of the specific classifier, the conventional noise detection method has a certain limitation. With the second detection method, since tag noise and an anomaly have a certain similarity, some distance-based anomaly detection techniques can be used for detection of tag noise. However, these anomaly detection methods are unsupervised classification methods, cannot fully utilize the contrast characteristics of different classes of tag noise, and their processing power for complex data, such as high-dimensional data and unbalanced data, is poor.

Disclosure of Invention

The invention aims to provide a label noise detection method based on multi-granularity relative density, which introduces granularity calculation into an improved relative density model and is an efficient label noise detection method.

In order to achieve the purpose, the invention adopts the technical scheme that:

a label noise detection method based on multi-granularity relative density comprises the following steps:

the data set D is divided into K clusters by adopting a KMeans algorithm, and the data of each cluster comprises a positive type sample set C_i+ and negative class sample set C_i-。

The improved relative density for each sample was calculated.

Changing the K value, and repeating the two steps.

The mean of all improved relative densities for each sample was taken.

The mean is compared to a noise threshold, and the sample is noise if the mean is greater than the noise threshold.

The above calculating the improved relative density comprises the steps of:

respectively calculating a positive sample set C_i+ and negative class sample set C_i-centroids, positive centroid class denoted C +, negative centroid class denoted C-;

calculating the distance from each sample to the same-class mass center and the different-class mass center respectively;

the ratio of the two distances obtained above is used as an improved relative density, specifically, D is composed of a positive type data set D + and a negative type data set D-;

for any point o +. epsilon.D +, the improved relative density is expressed as:

for any point o-e D-the improved relative density is expressed as:

distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.

The invention adopting the technical scheme has the beneficial effects that:

(1) the ratio of the distances from the samples to the same-class mass center and the heterogeneous mass center is used as a result under each granularity, and the original relative density model needs to calculate the distance between each sample and all other samples, so that the time cost is greatly reduced, and the problem of low speed of an original relative density algorithm is solved;

(2) and the multi-granularity is introduced into the relative density model, so that the risk of overfitting the model is reduced, and the model has better generalization capability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a multi-granular relative density based tag noise detection Method (MGRD);

fig. 2 is a schematic diagram of tag noise.

Detailed Description

The invention relates to a label noise detection method based on multi-granularity relative density, which is a research hotspot in the field of machine learning by processing label noise and covers a plurality of practical application fields. Currently, the mainstream research methods are classified into noise accommodation and noise filtering. When training data is contaminated with label noise, one obvious solution is to clean up the training data itself, similar to outliers or anomaly detection. The scheme belongs to noise filtering.

S1 divides the data set into K clusters and calculates the improved relative density of each sample in granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distance from each sample to the same-class mass center and the heterogeneous mass center is calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: dividing the data set into K clusters (the K value is different from the previous K value at the moment), repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.

As shown in fig. 1, the specific implementation flow of the present invention is as follows:

step S101, a KMeans algorithm is adopted to divide a data set D into K clusters, and data of each cluster comprises a positive type sample set C_i+ and negative class sample set C_i-。

Step S102, two samples x_iAnd x_jThe distance between is defined as:

step S103, defining Absolute density (Absolute density) according to the sample distance, giving a data set D, and representing the distance from p to q and the abbreviation of distance (p, q) for any p belongs to D, q belongs to D, D (p, q), and the neighbor of p is represented as N_k(p) wherein N_k(p)＝{q∈D|d(p,q)<k _ distance (p), where k represents the number of neighboring points, k _ distance (p) represents the value of the distance between the sample and the farthest one of the k neighboring points, and the absolute density of p is represented as: AbsoluteDensity (p, D) ═ k/Σ D (p, q), q ∈ N_k(p)；

Step S104, according to the sample distance, defining Homogeneous k neighbor and Heterogeneous k neighbor (Homogeneous k neighbor and Heterogeneous k neighbor). Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For an arbitrary point o ∈ D +, o, the homogeneous k neighbors are represented as HONk (o) = { q ∈ D + | D (o, q) < k _ distance (o) }, and the heterogeneous k neighbors are represented as HENk (o) } { q ∈ D- | D (o, q) < k _ distance (o) }.

Step S105, defining relative density according to Absolute density and Homogeneous k nearest neighbor and heterogeneous k nearest neighbor:

and step S106, defining a positive centroid and a negative centroid. Given data set C_i,C_iFrom the positive type sample set C_i+ and negative class sample set C_i-composition. The positive centroid class is denoted as C + and the negative centroid class is denoted as C-. Where the centroid is the combination of the mean of each attribute for all samples.

And step S107, calculating the distance from each sample to the same-class mass center and the different-class mass center of each sample. Wherein, the homogeneous mass center of the positive type sample is C +, the heterogeneous mass center is C-, the homogeneous mass center of the negative type sample is C-, and the heterogeneous mass center is C +.

Step S108, defining improved relative density. Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For any point o +. epsilon.D +, the improved relative density is expressed as:

for any point o-e D-the improved relative density is expressed as:

And step S109, changing the value of K, repeating the steps S101 to S108, and calculating the improved relative density of each sample at different granularities. And ending the circulation until the K value is larger than the preset K value.

Step S1010, averaging the improved relative densities of each sample obtained under all the granularities, and if the average is greater than a noise threshold, determining that the average is label noise.

Fig. 2 is a tag noise example:

the label noise points are given in fig. 2 for both the cases inside and outside the data cluster: o1 and O2. By analysis, the following characteristics can be obtained: the absolute densities of the label noise points "O1" and "O2" in the same kind of sample are smaller than those in the different kind of sample. Further introducing the concepts of homogeneous k neighbors and heterogeneous k neighbors, and further introducing a relative density model to detect the class noise.

When the relative density of the sample points is greater than 1, it indicates that the sample o is closer to the heterogeneous sample than the homogeneous sample, and the sample point is likely to be classified noise. By continuously increasing the relative density threshold (RD-threshold), the learning algorithm can obtain an optimal relative density threshold.

The following table shows the accuracy of the method of the invention (MGRD) under SVM classifier compared to the conventional relative density algorithm. It follows that MGRD can achieve in most cases an accuracy not weaker than the original classification algorithm and a comparable effect to the original relative density algorithm (RD Method), slightly stronger on several datasets like clearcancer (RD Method).

The following table shows the time comparison of the inventive Method (MGRD) with the conventional relative density under an SVM classifier. It can be seen that MGRD does not have much time overhead and RD Method on small data, but achieves extremely superior results on large data, especially on the codna dataset.

Wherein the size of the data set is as shown in the following table:

finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims

1. A label noise detection method based on multi-granularity relative density is characterized by comprising the following steps:

dividing a data set D into K clusters, wherein the data of each cluster comprises a positive type sample set C_i+ and negative class sample set C_i-；

Calculating an improved relative density for each sample;

changing the K value, and repeating the two steps;

averaging all the improved relative densities for each sample;

2. The label noise detection method based on multi-granularity relative density according to claim 1, characterized in that: the calculating the improved relative density comprises the steps of:

calculating the distance from each sample to the same-class mass center and the heterogeneous mass center of each sample;

the ratio of the two distances obtained above is taken as the improved relative density.

3. The label noise detection method based on multi-granularity relative density according to claim 2, characterized in that: the improved relative density is calculated by the following method:

d is composed of a positive class data set D + and a negative class data set D-;

for any point o +. epsilon.D +, the improved relative density is expressed as:

for any point o-e D-the improved relative density is expressed as:

4. The label noise detection method based on multi-granularity relative density according to any one of claims 1 to 3, characterized in that: the KMeans algorithm is used to divide the data set D into K clusters.