CN111178387A - Label noise detection method based on multi-granularity relative density - Google Patents

Label noise detection method based on multi-granularity relative density Download PDF

Info

Publication number
CN111178387A
CN111178387A CN201911222298.XA CN201911222298A CN111178387A CN 111178387 A CN111178387 A CN 111178387A CN 201911222298 A CN201911222298 A CN 201911222298A CN 111178387 A CN111178387 A CN 111178387A
Authority
CN
China
Prior art keywords
relative density
sample
improved relative
distance
granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911222298.XA
Other languages
Chinese (zh)
Inventor
夏书银
梁潇
刘群
王炳贵
陈百云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201911222298.XA priority Critical patent/CN111178387A/en
Publication of CN111178387A publication Critical patent/CN111178387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The invention discloses a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification. The method comprises the following steps: s1: the label noise detection method based on the multi-granularity relative density divides a data set into K clusters by using a KMeans algorithm, and calculates the improved relative density of each sample on granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distances from the samples to the same-class mass center and the heterogeneous mass center are respectively calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: changing the K value, repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.

Description

Label noise detection method based on multi-granularity relative density
Technical Field
The invention relates to a label noise detection method based on multi-granularity relative density, and belongs to the field of data classification.
Background
Real world data is always defective and the presence of noisy data is a consequence of such defects. Noise processing is an important task in machine learning. In the classification problem, noise is mainly classified into two categories: attribute noise and tag noise. Attribute noise is caused by errors in the process of inputting attributes, while tag noise is caused by tag contamination. Generally, tag noise may be more harmful than attribute noise, and first, a sample may have multiple features, whereas only one tag is present. Second, although each feature has its own unique importance, the impact of the tag on learning is always greater. The performance of the classifier is reduced by the presence of label noise and the complexity of the model is increased. In addition, there is also a large difference between the tag noise and the outlier noise. Outlier detection is an important step in many data mining tasks, however, sometimes outlier data samples have no effect on the classification results, while the label noise is different. In many practical scenarios, it has proven beneficial to detect and process noise, which has become an important area of machine learning.
Tag noise refers to noise whose tag is not properly recorded, mainly caused by insufficient information, coding and communication errors. In fact, the presence of noise is common. First, the information provided to experts may not be sufficient for them to perform reliable labeling. For example, in many interactive Web programs, the tags for the data are obtained through real-time feedback from the user. In the medical field, the detection results are often unknown and incomplete, and sometimes the information described by the medical language may be too limited and not much information is available, and the incompleteness of such information may also cause label noise. Furthermore, in some cases, the quality of the information is poor or the accuracy of the information is uncertain, for example, the patient may respond incorrectly or incorrectly during the illness, and even if the patient is asked with repeated questions, the patient's feedback may be answered differently, which also easily leads to the occurrence of tag noise. Second, the manual label itself may be in error. Of course, errors in such classification are not always caused by human experts, as automatic classification devices are now also used in different scenarios. Furthermore, the phenomenon of obtaining labels from experienced professionals is common, since collecting reliable labels is a time consuming and expensive task, but the labels provided according to the expertise are less reliable. Third, when the tagging task is subjective behavior of an individual, for example, in a medical image data analysis application, some experts may make some changes to the tag according to actual conditions, and may also cause tag noise to appear. Fourth, tag noise may also come only from data coding or communication problems.
The tag noise affects the learning performance of the classifier in different ways. Even if the tag noise is uniformly distributed, most loss function derivations are not robust to the tag noise, including exponential loss of AdaBoost, logarithmic loss of logistic regression, and hinge loss of SVM. AdaBoost takes a lot of time to learn the tag noise because AdaBoost increases the weight of misclassified samples. In addition, the tag noise also affects the learning process of the BP neural network and the decision tree, and the selection of k values in the kNN algorithm and the selection of kernel parameters in the kernel classifier.
The existence of noise samples seriously affects the efficiency of data mining modeling and can even cause deviation of mining results. Noise samples often make the classifier inefficient, cause overfitting, and seriously affect the performance of the classifier. Therefore, a series of pre-processing of the data is required before classifier learning. In order to apply label noise detection to classifier optimization, it is most important to efficiently detect label noise. To date, two major label noise detection techniques have been extensively studied, namely classifier-based and distance measurement-based label noise detection techniques. With the first detection method, since it is difficult to efficiently and reliably ensure detection of tag noise using a specific classifier because of relying on learning of the specific classifier, the conventional noise detection method has a certain limitation. With the second detection method, since tag noise and an anomaly have a certain similarity, some distance-based anomaly detection techniques can be used for detection of tag noise. However, these anomaly detection methods are unsupervised classification methods, cannot fully utilize the contrast characteristics of different classes of tag noise, and their processing power for complex data, such as high-dimensional data and unbalanced data, is poor.
Disclosure of Invention
The invention aims to provide a label noise detection method based on multi-granularity relative density, which introduces granularity calculation into an improved relative density model and is an efficient label noise detection method.
In order to achieve the purpose, the invention adopts the technical scheme that:
a label noise detection method based on multi-granularity relative density comprises the following steps:
the data set D is divided into K clusters by adopting a KMeans algorithm, and the data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-。
The improved relative density for each sample was calculated.
Changing the K value, and repeating the two steps.
The mean of all improved relative densities for each sample was taken.
The mean is compared to a noise threshold, and the sample is noise if the mean is greater than the noise threshold.
The above calculating the improved relative density comprises the steps of:
respectively calculating a positive sample set Ci+ and negative class sample set Ci-centroids, positive centroid class denoted C +, negative centroid class denoted C-;
calculating the distance from each sample to the same-class mass center and the different-class mass center respectively;
the ratio of the two distances obtained above is used as an improved relative density, specifically, D is composed of a positive type data set D + and a negative type data set D-;
for any point o +. epsilon.D +, the improved relative density is expressed as:
Figure BDA0002301186510000031
for any point o-e D-the improved relative density is expressed as:
Figure BDA0002301186510000032
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
The invention adopting the technical scheme has the beneficial effects that:
(1) the ratio of the distances from the samples to the same-class mass center and the heterogeneous mass center is used as a result under each granularity, and the original relative density model needs to calculate the distance between each sample and all other samples, so that the time cost is greatly reduced, and the problem of low speed of an original relative density algorithm is solved;
(2) and the multi-granularity is introduced into the relative density model, so that the risk of overfitting the model is reduced, and the model has better generalization capability.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a multi-granular relative density based tag noise detection Method (MGRD);
fig. 2 is a schematic diagram of tag noise.
Detailed Description
The invention relates to a label noise detection method based on multi-granularity relative density, which is a research hotspot in the field of machine learning by processing label noise and covers a plurality of practical application fields. Currently, the mainstream research methods are classified into noise accommodation and noise filtering. When training data is contaminated with label noise, one obvious solution is to clean up the training data itself, similar to outliers or anomaly detection. The scheme belongs to noise filtering.
S1 divides the data set into K clusters and calculates the improved relative density of each sample in granularity. The improved relative density is defined as that firstly, the mass centers of the positive and negative samples are respectively calculated, then the distance from each sample to the same-class mass center and the heterogeneous mass center is calculated, and the ratio of the distances is used as the improved relative density under the granularity; s2: dividing the data set into K clusters (the K value is different from the previous K value at the moment), repeating the process in S1, and calculating the improved relative density of each sample at different granularities; s3: samples whose improved relative density exceeded a certain threshold were taken as the label noise. The present invention introduces particle size calculations into the improved relative density model, which is more efficient than conventional methods.
As shown in fig. 1, the specific implementation flow of the present invention is as follows:
step S101, a KMeans algorithm is adopted to divide a data set D into K clusters, and data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-。
Step S102, two samples xiAnd xjThe distance between is defined as:
Figure BDA0002301186510000041
step S103, defining Absolute density (Absolute density) according to the sample distance, giving a data set D, and representing the distance from p to q and the abbreviation of distance (p, q) for any p belongs to D, q belongs to D, D (p, q), and the neighbor of p is represented as Nk(p) wherein Nk(p)={q∈D|d(p,q)<k _ distance (p), where k represents the number of neighboring points, k _ distance (p) represents the value of the distance between the sample and the farthest one of the k neighboring points, and the absolute density of p is represented as: AbsoluteDensity (p, D) ═ k/Σ D (p, q), q ∈ Nk(p);
Step S104, according to the sample distance, defining Homogeneous k neighbor and Heterogeneous k neighbor (Homogeneous k neighbor and Heterogeneous k neighbor). Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For an arbitrary point o ∈ D +, o, the homogeneous k neighbors are represented as HONk (o) = { q ∈ D + | D (o, q) < k _ distance (o) }, and the heterogeneous k neighbors are represented as HENk (o) } { q ∈ D- | D (o, q) < k _ distance (o) }.
Step S105, defining relative density according to Absolute density and Homogeneous k nearest neighbor and heterogeneous k nearest neighbor:
Figure BDA0002301186510000042
and step S106, defining a positive centroid and a negative centroid. Given data set Ci,CiFrom the positive type sample set Ci+ and negative class sample set Ci-composition. The positive centroid class is denoted as C + and the negative centroid class is denoted as C-. Where the centroid is the combination of the mean of each attribute for all samples.
And step S107, calculating the distance from each sample to the same-class mass center and the different-class mass center of each sample. Wherein, the homogeneous mass center of the positive type sample is C +, the heterogeneous mass center is C-, the homogeneous mass center of the negative type sample is C-, and the heterogeneous mass center is C +.
Step S108, defining improved relative density. Given a data set D, D consists of a positive class data set D + and a negative class data set D-. For any point o +. epsilon.D +, the improved relative density is expressed as:
Figure BDA0002301186510000051
for any point o-e D-the improved relative density is expressed as:
Figure BDA0002301186510000052
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
And step S109, changing the value of K, repeating the steps S101 to S108, and calculating the improved relative density of each sample at different granularities. And ending the circulation until the K value is larger than the preset K value.
Step S1010, averaging the improved relative densities of each sample obtained under all the granularities, and if the average is greater than a noise threshold, determining that the average is label noise.
Fig. 2 is a tag noise example:
the label noise points are given in fig. 2 for both the cases inside and outside the data cluster: o1 and O2. By analysis, the following characteristics can be obtained: the absolute densities of the label noise points "O1" and "O2" in the same kind of sample are smaller than those in the different kind of sample. Further introducing the concepts of homogeneous k neighbors and heterogeneous k neighbors, and further introducing a relative density model to detect the class noise.
When the relative density of the sample points is greater than 1, it indicates that the sample o is closer to the heterogeneous sample than the homogeneous sample, and the sample point is likely to be classified noise. By continuously increasing the relative density threshold (RD-threshold), the learning algorithm can obtain an optimal relative density threshold.
The following table shows the accuracy of the method of the invention (MGRD) under SVM classifier compared to the conventional relative density algorithm. It follows that MGRD can achieve in most cases an accuracy not weaker than the original classification algorithm and a comparable effect to the original relative density algorithm (RD Method), slightly stronger on several datasets like clearcancer (RD Method).
Figure BDA0002301186510000061
The following table shows the time comparison of the inventive Method (MGRD) with the conventional relative density under an SVM classifier. It can be seen that MGRD does not have much time overhead and RD Method on small data, but achieves extremely superior results on large data, especially on the codna dataset.
Figure BDA0002301186510000062
Wherein the size of the data set is as shown in the following table:
Figure BDA0002301186510000071
finally, it is noted that the above-mentioned preferred embodiments illustrate rather than limit the invention, and that, although the invention has been described in detail with reference to the above-mentioned preferred embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the scope of the invention as defined by the appended claims.

Claims (4)

1. A label noise detection method based on multi-granularity relative density is characterized by comprising the following steps:
dividing a data set D into K clusters, wherein the data of each cluster comprises a positive type sample set Ci+ and negative class sample set Ci-;
Calculating an improved relative density for each sample;
changing the K value, and repeating the two steps;
averaging all the improved relative densities for each sample;
the mean is compared to a noise threshold, and the sample is noise if the mean is greater than the noise threshold.
2. The label noise detection method based on multi-granularity relative density according to claim 1, characterized in that: the calculating the improved relative density comprises the steps of:
respectively calculating a positive sample set Ci+ and negative class sample set Ci-centroids, positive centroid class denoted C +, negative centroid class denoted C-;
calculating the distance from each sample to the same-class mass center and the heterogeneous mass center of each sample;
the ratio of the two distances obtained above is taken as the improved relative density.
3. The label noise detection method based on multi-granularity relative density according to claim 2, characterized in that: the improved relative density is calculated by the following method:
d is composed of a positive class data set D + and a negative class data set D-;
for any point o +. epsilon.D +, the improved relative density is expressed as:
Figure FDA0002301186500000011
for any point o-e D-the improved relative density is expressed as:
Figure FDA0002301186500000012
distance (o +, c +), distance (o +, c-) respectively represent the distance from o + to the homogeneous centroid and the heterogeneous centroid, and distance (o-, c-) and distance (o-, c +) respectively represent the distance from o-to the homogeneous centroid and the heterogeneous centroid.
4. The label noise detection method based on multi-granularity relative density according to any one of claims 1 to 3, characterized in that: the KMeans algorithm is used to divide the data set D into K clusters.
CN201911222298.XA 2019-12-03 2019-12-03 Label noise detection method based on multi-granularity relative density Pending CN111178387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911222298.XA CN111178387A (en) 2019-12-03 2019-12-03 Label noise detection method based on multi-granularity relative density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911222298.XA CN111178387A (en) 2019-12-03 2019-12-03 Label noise detection method based on multi-granularity relative density

Publications (1)

Publication Number Publication Date
CN111178387A true CN111178387A (en) 2020-05-19

Family

ID=70653780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911222298.XA Pending CN111178387A (en) 2019-12-03 2019-12-03 Label noise detection method based on multi-granularity relative density

Country Status (1)

Country Link
CN (1) CN111178387A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638322A (en) * 2022-05-20 2022-06-17 南京大学 Full-automatic target detection system and method based on given description in open scene
CN117313900A (en) * 2023-11-23 2023-12-29 全芯智造技术有限公司 Method, apparatus and medium for data processing

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114638322A (en) * 2022-05-20 2022-06-17 南京大学 Full-automatic target detection system and method based on given description in open scene
CN114638322B (en) * 2022-05-20 2022-09-13 南京大学 Full-automatic target detection system and method based on given description in open scene
CN117313900A (en) * 2023-11-23 2023-12-29 全芯智造技术有限公司 Method, apparatus and medium for data processing
CN117313900B (en) * 2023-11-23 2024-03-08 全芯智造技术有限公司 Method, apparatus and medium for data processing

Similar Documents

Publication Publication Date Title
CN111914644B (en) Dual-mode cooperation based weak supervision time sequence action positioning method and system
CN110929679B (en) GAN-based unsupervised self-adaptive pedestrian re-identification method
CN110390275B (en) Gesture classification method based on transfer learning
CN110929848B (en) Training and tracking method based on multi-challenge perception learning model
CN109711411B (en) Image segmentation and identification method based on capsule neurons
Perveen et al. Facial expression recognition using facial characteristic points and Gini index
CN111178387A (en) Label noise detection method based on multi-granularity relative density
CN111680702A (en) Method for realizing weak supervision image significance detection by using detection frame
CN114287005A (en) Negative sampling algorithm for enhancing image classification
CN114359199A (en) Fish counting method, device, equipment and medium based on deep learning
Cheng et al. Adgan: A scalable gan-based architecture for image anomaly detection
Yan et al. A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks
CN115761574A (en) Weak surveillance video target segmentation method and device based on frame labeling
CN112836755B (en) Sample image generation method and system based on deep learning
CN110852344A (en) Intelligent substation network fault classification based method
Liu et al. Handling inter-class and intra-class imbalance in class-imbalanced learning
CN112446417B (en) Spindle-shaped fruit image segmentation method and system based on multilayer superpixel segmentation
CN110363240B (en) Medical image classification method and system
WO2024027146A1 (en) Array-type facial beauty prediction method, and device and storage medium
He et al. Salient region segmentation
CN111797935A (en) Semi-supervised deep network picture classification method based on group intelligence
CN114743133A (en) Lightweight small sample video classification and identification method and system
CN111160161B (en) Self-learning face age estimation method based on noise elimination
CN115147864A (en) Infrared human body behavior identification method based on collaborative heterogeneous deep learning network
CN115019342A (en) Endangered animal target detection method based on class relation reasoning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200519