CN111178387A - Label noise detection method based on multi-granularity relative density - Google Patents
Label noise detection method based on multi-granularity relative density Download PDFInfo
- Publication number
- CN111178387A CN111178387A CN201911222298.XA CN201911222298A CN111178387A CN 111178387 A CN111178387 A CN 111178387A CN 201911222298 A CN201911222298 A CN 201911222298A CN 111178387 A CN111178387 A CN 111178387A
- Authority
- CN
- China
- Prior art keywords
- relative density
- sample
- improved relative
- distance
- granularity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Abstract
The invention discloses a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification. The method comprises the following steps: s1: the label noise detection method based on the multi-granularity relative density divides a data set into K clusters by using a KMeans algorithm, and calculates the improved relative density of each sample on granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distances from the samples to the same-class mass center and the heterogeneous mass center are respectively calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: changing the K value, repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.
Description
Technical Field
The invention relates to a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification.
Background
Real world data is always defective and the presence of noisy data is a consequence of such defects. Noise processing is an important task in machine learning. In the classification problem, noise is mainly classified into two categories: attribute noise and tag noise. Attribute noise is caused by errors in the process of inputting attributes, while tag noise is caused by tag contamination. Generally, tag noise may be more harmful than attribute noise, and first, a sample may have multiple features, whereas only one tag is present. Second, although each feature has its own unique importance, the impact of the tag on learning is always greater. The performance of the classifier is reduced by the presence of label noise and the complexity of the model is increased. In addition, there is also a large difference between the tag noise and the outlier noise. Outlier detection is an important step in many data mining tasks, however, sometimes outlier data samples have no effect on the classification results, while the label noise is different. In many practical scenarios, it has proven beneficial to detect and process noise, which has become an important area of machine learning.
Tag noise refers to noise whose tag is not properly recorded, mainly caused by insufficient information, coding and communication errors. In fact, the presence of noise is common. First, the information provided to experts may not be sufficient for them to perform reliable labeling. For example, in many interactive Web programs, the tags for the data are obtained through real-time feedback from the user. In the medical field, the detection results are often unknown and incomplete, and sometimes the information described by the medical language may be too limited and not much information is available, and the incompleteness of such information may also cause label noise. Furthermore, in some cases, the quality of the information is poor or the accuracy of the information is uncertain, for example, the patient may respond incorrectly or incorrectly during the illness, and even if the patient is asked with repeated questions, the patient's feedback may be answered differently, which also easily leads to the occurrence of tag noise. Second, the manual label itself may be in error. Of course, errors in such classification are not always caused by human experts, as automatic classification devices are now also used in different scenarios. Furthermore, the phenomenon of obtaining labels from experienced professionals is common, since collecting reliable labels is a time consuming and expensive task, but the labels provided according to the expertise are less reliable. Third, when the tagging task is subjective behavior of an individual, for example, in a medical image data analysis application, some experts may make some changes to the tag according to actual conditions, and may also cause tag noise to appear. Fourth, tag noise may also come only from data coding or communication problems.
The tag noise affects the learning performance of the classifier in different ways. Even if the tag noise is uniformly distributed, most loss function derivations are not robust to the tag noise, including exponential loss of AdaBoost, logarithmic loss of logistic regression, and hinge loss of SVM. AdaBoost takes a lot of time to learn the tag noise because AdaBoost increases the weight of misclassified samples. In addition, the tag noise also affects the learning process of the BP neural network and the decision tree, and the selection of k values in the kNN algorithm and the selection of kernel parameters in the kernel classifier.
The existence of noise samples seriously affects the efficiency of data mining modeling and can even cause deviation of mining results. Noise samples often make the classifier inefficient, cause overfitting, and seriously affect the performance of the classifier. Therefore, a series of pre-processing of the data is required before classifier learning. In order to apply label noise detection to classifier optimization, it is most important to efficiently detect label noise. To date, two major label noise detection techniques have been extensively studied, namely classifier-based and distance measurement-based label noise detection techniques. With the first detection method, since it is difficult to efficiently and reliably ensure detection of tag noise using a specific classifier because of relying on learning of the specific classifier, the conventional noise detection method has a certain limitation. With the second detection method, since tag noise and an anomaly have a certain similarity, some distance-based anomaly detection techniques can be used for detection of tag noise. However, these anomaly detection methods are unsupervised classification methods, cannot fully utilize the contrast characteristics of different classes of tag noise, and their processing power for complex data, such as high-dimensional data and unbalanced data, is poor.
Disclosure of Invention
The invention aims to provide a label noise detection method based on multi-granularity relative density, which introduces granularity calculation into an improved relative density model and is an efficient label noise detection method.
In order to achieve the purpose, the invention adopts the technical scheme that:
a label noise detection method based on multi-granularity relative density comprises the following steps:
the data set D is divided into K clusters by adopting a KMeans algorithm, and the data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-。
The improved relative density for each sample was calculated.
Changing the K value, and repeating the two steps.
The mean of all improved relative densities for each sample was taken.
The mean is compared to a noise threshold, and the sample is noise if the mean is greater than the noise threshold.
The above calculating the improved relative density comprises the steps of:
respectively calculating a positive sample set Ci+ and negative class sample set Ci-centroids, positive centroid class denoted C +, negative centroid class denoted C-;
calculating the distance from each sample to the same-class mass center and the different-class mass center respectively;
the ratio of the two distances obtained above is used as an improved relative density, specifically, D is composed of a positive type data set D + and a negative type data set D-;
for any point o +. epsilon.D +, the improved relative density is expressed as:
for any point o-e D-the improved relative density is expressed as:
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
The invention adopting the technical scheme has the beneficial effects that:
(1) the ratio of the distances from the samples to the same-class mass center and the heterogeneous mass center is used as a result under each granularity, and the original relative density model needs to calculate the distance between each sample and all other samples, so that the time cost is greatly reduced, and the problem of low speed of an original relative density algorithm is solved;
(2) and the multi-granularity is introduced into the relative density model, so that the risk of overfitting the model is reduced, and the model has better generalization capability.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a multi-granular relative density based tag noise detection Method (MGRD);
fig. 2 is a schematic diagram of tag noise.
Detailed Description
The invention relates to a label noise detection method based on multi-granularity relative density, which is a research hotspot in the field of machine learning by processing label noise and covers a plurality of practical application fields. Currently, the mainstream research methods are classified into noise accommodation and noise filtering. When training data is contaminated with label noise, one obvious solution is to clean up the training data itself, similar to outliers or anomaly detection. The scheme belongs to noise filtering.
S1 divides the data set into K clusters and calculates the improved relative density of each sample in granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distance from each sample to the same-class mass center and the heterogeneous mass center is calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: dividing the data set into K clusters (the K value is different from the previous K value at the moment), repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.
As shown in fig. 1, the specific implementation flow of the present invention is as follows:
step S101, a KMeans algorithm is adopted to divide a data set D into K clusters, and data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-。
Step S102, two samples xiAnd xjThe distance between is defined as:
step S103, defining Absolute density (Absolute density) according to the sample distance, giving a data set D, and representing the distance from p to q and the abbreviation of distance (p, q) for any p belongs to D, q belongs to D, D (p, q), and the neighbor of p is represented as Nk(p) wherein Nk(p)={q∈D|d(p,q)<k _ distance (p), where k represents the number of neighboring points, k _ distance (p) represents the value of the distance between the sample and the farthest one of the k neighboring points, and the absolute density of p is represented as: AbsoluteDensity (p, D) ═ k/Σ D (p, q), q ∈ Nk(p);
Step S104, according to the sample distance, defining Homogeneous k neighbor and Heterogeneous k neighbor (Homogeneous k neighbor and Heterogeneous k neighbor). Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For an arbitrary point o ∈ D +, o, the homogeneous k neighbors are represented as HONk (o) = { q ∈ D + | D (o, q) < k _ distance (o) }, and the heterogeneous k neighbors are represented as HENk (o) } { q ∈ D- | D (o, q) < k _ distance (o) }.
Step S105, defining relative density according to Absolute density and Homogeneous k nearest neighbor and heterogeneous k nearest neighbor:
and step S106, defining a positive centroid and a negative centroid. Given data set Ci,CiFrom the positive type sample set Ci+ and negative class sample set Ci-composition. The positive centroid class is denoted as C + and the negative centroid class is denoted as C-. Where the centroid is the combination of the mean of each attribute for all samples.
And step S107, calculating the distance from each sample to the same-class mass center and the different-class mass center of each sample. Wherein, the homogeneous mass center of the positive type sample is C +, the heterogeneous mass center is C-, the homogeneous mass center of the negative type sample is C-, and the heterogeneous mass center is C +.
Step S108, defining improved relative density. Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For any point o +. epsilon.D +, the improved relative density is expressed as:
for any point o-e D-the improved relative density is expressed as:
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
And step S109, changing the value of K, repeating the steps S101 to S108, and calculating the improved relative density of each sample at different granularities. And ending the circulation until the K value is larger than the preset K value.
Step S1010, averaging the improved relative densities of each sample obtained under all the granularities, and if the average is greater than a noise threshold, determining that the average is label noise.
Fig. 2 is a tag noise example:
the label noise points are given in fig. 2 for both the cases inside and outside the data cluster: o1 and O2. By analysis, the following characteristics can be obtained: the absolute densities of the label noise points "O1" and "O2" in the same kind of sample are smaller than those in the different kind of sample. Further introducing the concepts of homogeneous k neighbors and heterogeneous k neighbors, and further introducing a relative density model to detect the class noise.
When the relative density of the sample points is greater than 1, it indicates that the sample o is closer to the heterogeneous sample than the homogeneous sample, and the sample point is likely to be classified noise. By continuously increasing the relative density threshold (RD-threshold), the learning algorithm can obtain an optimal relative density threshold.
The following table shows the accuracy of the method of the invention (MGRD) under SVM classifier compared to the conventional relative density algorithm. It follows that MGRD can achieve in most cases an accuracy not weaker than the original classification algorithm and a comparable effect to the original relative density algorithm (RD Method), slightly stronger on several datasets like clearcancer (RD Method).
The following table shows the time comparison of the inventive Method (MGRD) with the conventional relative density under an SVM classifier. It can be seen that MGRD does not have much time overhead and RD Method on small data, but achieves extremely superior results on large data, especially on the codna dataset.
Wherein the size of the data set is as shown in the following table:
finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.
Claims (4)
1. A label noise detection method based on multi-granularity relative density is characterized by comprising the following steps:
dividing a data set D into K clusters, wherein the data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-;
Calculating an improved relative density for each sample;
changing the K value, and repeating the two steps;
averaging all the improved relative densities for each sample;
the mean is compared to a noise threshold, and the sample is noise if the mean is greater than the noise threshold.
2. The label noise detection method based on multi-granularity relative density according to claim 1, characterized in that: the calculating the improved relative density comprises the steps of:
respectively calculating a positive sample set Ci+ and negative class sample set Ci-centroids, positive centroid class denoted C +, negative centroid class denoted C-;
calculating the distance from each sample to the same-class mass center and the heterogeneous mass center of each sample;
the ratio of the two distances obtained above is taken as the improved relative density.
3. The label noise detection method based on multi-granularity relative density according to claim 2, characterized in that: the improved relative density is calculated by the following method:
d is composed of a positive class data set D + and a negative class data set D-;
for any point o +. epsilon.D +, the improved relative density is expressed as:
for any point o-e D-the improved relative density is expressed as:
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
4. The label noise detection method based on multi-granularity relative density according to any one of claims 1 to 3, characterized in that: the KMeans algorithm is used to divide the data set D into K clusters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911222298.XA CN111178387A (en) | 2019-12-03 | 2019-12-03 | Label noise detection method based on multi-granularity relative density |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911222298.XA CN111178387A (en) | 2019-12-03 | 2019-12-03 | Label noise detection method based on multi-granularity relative density |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111178387A true CN111178387A (en) | 2020-05-19 |
Family
ID=70653780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911222298.XA Pending CN111178387A (en) | 2019-12-03 | 2019-12-03 | Label noise detection method based on multi-granularity relative density |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178387A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114638322A (en) * | 2022-05-20 | 2022-06-17 | 南京大学 | Full-automatic target detection system and method based on given description in open scene |
CN117313900A (en) * | 2023-11-23 | 2023-12-29 | 全芯智造技术有限公司 | Method, apparatus and medium for data processing |
-
2019
- 2019-12-03 CN CN201911222298.XA patent/CN111178387A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114638322A (en) * | 2022-05-20 | 2022-06-17 | 南京大学 | Full-automatic target detection system and method based on given description in open scene |
CN114638322B (en) * | 2022-05-20 | 2022-09-13 | 南京大学 | Full-automatic target detection system and method based on given description in open scene |
CN117313900A (en) * | 2023-11-23 | 2023-12-29 | 全芯智造技术有限公司 | Method, apparatus and medium for data processing |
CN117313900B (en) * | 2023-11-23 | 2024-03-08 | 全芯智造技术有限公司 | Method, apparatus and medium for data processing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111914644B (en) | Dual-mode cooperation based weak supervision time sequence action positioning method and system | |
CN110929679B (en) | GAN-based unsupervised self-adaptive pedestrian re-identification method | |
CN110390275B (en) | Gesture classification method based on transfer learning | |
CN110929848B (en) | Training and tracking method based on multi-challenge perception learning model | |
CN109711411B (en) | Image segmentation and identification method based on capsule neurons | |
Perveen et al. | Facial expression recognition using facial characteristic points and Gini index | |
CN111178387A (en) | Label noise detection method based on multi-granularity relative density | |
CN111680702A (en) | Method for realizing weak supervision image significance detection by using detection frame | |
CN114287005A (en) | Negative sampling algorithm for enhancing image classification | |
CN114359199A (en) | Fish counting method, device, equipment and medium based on deep learning | |
Cheng et al. | Adgan: A scalable gan-based architecture for image anomaly detection | |
Yan et al. | A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks | |
CN115761574A (en) | Weak surveillance video target segmentation method and device based on frame labeling | |
CN112836755B (en) | Sample image generation method and system based on deep learning | |
CN110852344A (en) | Intelligent substation network fault classification based method | |
Liu et al. | Handling inter-class and intra-class imbalance in class-imbalanced learning | |
CN112446417B (en) | Spindle-shaped fruit image segmentation method and system based on multilayer superpixel segmentation | |
CN110363240B (en) | Medical image classification method and system | |
WO2024027146A1 (en) | Array-type facial beauty prediction method, and device and storage medium | |
He et al. | Salient region segmentation | |
CN111797935A (en) | Semi-supervised deep network picture classification method based on group intelligence | |
CN114743133A (en) | Lightweight small sample video classification and identification method and system | |
CN111160161B (en) | Self-learning face age estimation method based on noise elimination | |
CN115147864A (en) | Infrared human body behavior identification method based on collaborative heterogeneous deep learning network | |
CN115019342A (en) | Endangered animal target detection method based on class relation reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200519 |