CN112801113A

CN112801113A - Data denoising method based on multi-scale reliable clustering

Info

Publication number: CN112801113A
Application number: CN202110173919.0A
Authority: CN
Inventors: 王素玉; 李越
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-14

Abstract

The invention discloses a data denoising method based on multi-scale reliable clustering, and provides a single-class denoising algorithm for the early-stage processing of data in a commodity identification model, aiming at solving the problem of dirty and messy training data of deep learning. The algorithm finds out discrete points outside the clusters, namely noise data, in a clustering mode for each class, and eliminates the noise data, so that an end-to-end automatic data cleaning process is realized, the time for manually cleaning the data is saved, and the working efficiency of data processing is greatly improved. The experimental result shows that the precision obtained by using the network trained by the denoised data set is higher than that obtained by using the original data training network, and the good performance of the denoising method is reflected from the side surface.

Description

Data denoising method based on multi-scale reliable clustering

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a data denoising method based on a reliable clustering method for scales.

Background

In recent years, a dbscan-based clustering method is widely applied to pedestrian re-identification data processing steps, the purpose is to realize unsupervised learning in a clustering mode, and in addition, the method can be used for data cleaning, but the clustering effect is poor due to the problems of uneven sample density distribution, large inter-class distance difference and the like during clustering. The most widely applied clustering method based on the density is DBSCAN, wherein a Heterology DBSCAN (HDBSCAN) algorithm is used for overcoming the defect that the clustering effect is not ideal under the condition of unbalanced density in the original DBSCAN algorithm, and the adaptability of the HDBSCAN algorithm to unbalanced data is enhanced by defining two parameters of longRegionQuery and ShortRegionQuery; the Enhanced DBSCAN (EDBSCAN) algorithm is also used for solving the problem of low clustering accuracy under the condition of non-uniform data density, and by calculating the distance density between different central points and all points, if the density is smaller than a preset threshold value, the central point is used for next clustering, otherwise, the point is assigned as a boundary point. In order to improve the operation efficiency of the algorithm, the L-DBSCAN is provided as a comprehensive efficient algorithm, the algorithm is used for improving the algorithm efficiency by selecting a proper clustering method based on the characteristics of a data set, and the method is suitable for clustering large-scale data sets.

Disclosure of Invention

The invention aims to solve the problems that: in the data processing before deep learning training, due to the fact that training data are dirty and messy, a large amount of noise data exist, accuracy during network training can be seriously affected, and the existing method can only screen the data in a manual cleaning mode and cannot efficiently process the data. A new data denoising method based on clustering needs to be provided, and an algorithm is used for cleaning data in a large scale.

In order to solve the problems, the invention provides a data denoising method based on multi-scale reliable clustering, which finds out the noise data at discrete points and eliminates the noise data by clustering all data, and saves the time for manually selecting the data, thereby achieving the purpose of cleaning the data. As shown in fig. 1, the data denoising method based on multi-scale reliable clustering includes the following steps:

step 1, data feature extraction: the data selected in the experiment is commodity identification data for a match of CVPR 2020AliProducts Change: Large-scale Product registration, a total of 5 million sku classes and 300 pieces of data. Firstly, all data are subjected to feature extraction through a pre-trained commodity identification network model, the features of each piece of data are stored in front of an FC full connection layer, the data are converted into 1 × 2048 tensors, corresponding labels are generated at the same time, and the corresponding labels are stored into a data file in a format of npy.

And 2, clustering by an algorithm. We use the dbscan algorithm as a clustering algorithm and make improvements on it. And (3) performing a dbscan clustering algorithm on the data in each sku category once on the basis of the step 1. In the returned results, the labels of the data of the non-discrete points are between 0 and positive infinity, representing the class number of the respective cluster, while the label of the noise data of the discrete points is-1.

And step 3, carrying out noise initial detection. And through the clustering method in the second step, acquiring a data index with a label of-1, then acquiring a storage path of each piece of data, and deleting each piece of discrete data according to the path, thereby completing the data denoising process.

And 4, deeply thinning. In the process of noise preliminary examination, if the data base number of the class is too large, the clustering effect is reduced, forty percent of data can be judged as noise data, and a large number of correct samples are contained in the noise data, so that the distribution of training data is influenced. Firstly, the feature average value of all data in each subclass after the class clustering is calculated to be used as the feature of a query library, then each piece of noise data judged as the noise data is compared with the feature in the query library, if the similarity of the noise data and the feature in the query library reaches more than eighty percent, the noise data can be judged to be misjudged, and the noise data is not deleted in the denoising process. By performing such a secondary judgment on the noise data of each category, we can reduce a large misjudgment rate.

As a further preferable mode, the step 2) comprises the following specific steps:

in the clustering process, a dbscan clustering algorithm is adopted, and compared with kmeans clustering, the dbscan does not need any parameter and is suitable for convex samples and non-convex samples. Dense data sets of arbitrary shape can be clustered. dbscan has four important parameters: direct density, reachable density, eps and minisample. As shown in fig. 2, density through means that object a is within the domain of object B; the density can be reached, namely the object chain ABC exists, the density between AB is directly reached, the density between BC is directly reached, and the AC density can be reached; eps is the radius of the field at which the density is defined; the minimum is a threshold value of a core point, when clustering is carried out, firstly, data is traversed, starting from a certain core object p, all density connecting points of the core object are searched, the density connecting points are added into a cluster where p is located, core objects in direct density reachable points of p are searched, the density connecting points of the objects are also added into p, and the operation is carried out in a recursive mode until the core objects cannot be expanded any more. Finally, the result returned by the clustering algorithm is a string of numbers which are as long as the original data size and are between negative one and positive infinity, the numbers at each position represent the category to which the data corresponding to the index is clustered, and the denoising process is completed as long as the picture with the corresponding clustering value of-1 in the data is found out. As shown in fig. 2, a denotes a core object, BC denotes a boundary point, and N denotes an outlier. And point N is the noisy data we find by clustering.

In order to verify the denoising performance of the clustering algorithm, 8632 pieces of 40 types of head data and 8632 pieces of 20 types of tail data are selected from the AliProducts Challenge data sets for denoising experiments. There are ten percent of noise data in each category. The data are clustered by adopting a dbscan clustering algorithm, and noise data are found and eliminated by adjusting two thresholds of eps and min _ sample, so that the data are cleaned, and the following is detailed information of experiments.

It can be seen from the table that when the dbscan clustering algorithm is used for performing experiments on data, noise data of about ninety percent can be removed, but meanwhile, in the clustering process of the clustering algorithm, the linkage of two parameters, namely eps and min _ sample, is complex, and the algorithm has limitation, so that a part of noise data can be omitted in the clustering process, and the algorithm with a better and more reliable clustering effect is researched.

We propose a reliable clustering approach based on multi-scale. Two parameters are first defined: independence of clusters and compactness of clusters.

Independence of clusters. One reliable cluster should be independent of other clusters and individual samples. Intuitively, a cluster can be considered highly independent if it is far from other samples. However, we cannot use the distance of the cluster centroid from the off-cluster samples to measure cluster independence due to the uneven density of the underlying space. In general, the clustering results can be adjusted by changing some hyper-parameters of the clustering criteria. Clustering criteria may be relaxed so that each cluster may contain more samples, or tightened so that each cluster may contain fewer samples. We use

Representing a cluster

In clustering samples using the dbscan clustering algorithm, we propose the following metric to measure cluster independence, which is expressed as a score of the cross-over ratio.

As can be seen from the equation 1,

when it is the case that the clustering criterion becomes loose,

the cluster set of (a) is selected,

the larger the size of the tube is,

the more independent the clustering cluster is; i.e. if one looses the clustering criteria, no more new samples will be included in the new clusters. As can be seen from the third upper half of the figure, if a cluster has poor independence, some clusters that do not belong to the category may be mistakenly classified into the category after the clustering criterion is relaxed, so that the clustering shape of the cluster may be changed, thereby affecting the clustering performance.

Compactness of the cluster. A reliable cluster should also be compact, i.e. samples within the same cluster should have a small inter-sample distance. In the extreme case, when a cluster is the most compact, the distance between all samples within the cluster is zero, and even if the clustering criterion is tightened, its samples are not segmented into different clusters. Based on this assumption, we can define the following metric to determine compactness within a cluster as:

when the clustering criterion becomes strict

The cluster set of (a) is selected,

the larger is indicated at

The smaller the sample spacing around after clustering. As can be seen in the lower part of fig. 3, when a stricter criterion is adopted, clusters having a larger distance between samples contain fewer points, so that clusters originally belonging to one class are divided into two or more classes, which we call clusters have poor compactness.

According to the index for measuring the cluster reliability, the independence and compactness score of each data point in the cluster can be calculated, so that each piece of data is judged more finely, and the noise data which is missed to be detected is found out.

In order to verify the denoising performance of the multi-scale reliable clustering algorithm, the same data set is adopted to set a contrast experiment, and the following experiment results show that all noise data can be removed from head data, and the denoising process of tail data has good performance.

The comparison experiment shows that the method based on the multi-scale reliable clustering has a much better effect compared with other clustering methods, and finally, the method decides to select and use the improved dbscan clustering algorithm for denoising.

The core technology of this patent includes:

(1) and a brand new clustering idea is adopted, and the accuracy of data clustering is improved by using a multi-scale reliable clustering method.

(2) An end-to-end data automatic cleaning model is constructed, automatic cleaning of data is achieved, and ninety percent of noisy data can be removed.

(3) The model is trained after the data are subjected to early-stage processing by using a data cleaning method, so that the recognition accuracy of one percent of the model is improved.

Drawings

FIG. 1 is a flow chart of the data denoising method based on multi-scale reliable clustering of the present invention

FIG. 2 is a schematic diagram of the dbscan clustering algorithm principle of the present invention

FIG. 3 is a schematic diagram of the principle of the multi-scale reliable clustering algorithm of the present invention

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

Fifty thousand categories of data sets are downloaded from a CVPR 2020 AliProduces Challenge Large-scale Product Recognation game, 20 categories of head data and 20 categories of tail data are randomly selected, and the 40 categories of 8632 data are subjected to feature extraction on our Recognition model, and the labels and paths of the data are saved in a format of npy.

After the characteristics are obtained, clustering is carried out on the data by using a multi-scale reliable clustering algorithm, and whether the cluster clustering stability is good or not is judged according to the jitter condition of independence and compactness of each piece of data after clustering, so that more noise data are filtered, and very clean training data are obtained.

On the premise of not changing the training mode, the training data of the commodity recognition model is subjected to integral denoising by using the method, and the accuracy of model recognition is improved by one percent according to a chart.

Claims

1. The data denoising algorithm based on the multi-scale reliable clustering is characterized by comprising the following four steps:

step 1, data feature extraction: extracting the characteristics of all data through a pre-trained commodity identification network model, storing the characteristics of each data in front of an FC full connection layer, converting the characteristics into 1 × 2048 tensors, generating corresponding labels at the same time, and storing the labels into a data file with a format of npy;

step 2, algorithm clustering; using dbscan algorithm as clustering algorithm, performing dbscan clustering algorithm once on the data in each sku category, wherein in the returned result, the label of the data of the non-discrete point is between 0 and positive infinity, the label represents the category number of each cluster, and the label of the noise data of the discrete point is-1;

step 3, noise initial detection; acquiring a data index with a label of-1 by the clustering method in the step 2, then acquiring a storage path of each piece of data, and deleting each piece of discrete data according to the path, thereby completing the data denoising process;

step 4, deep thinning; on the basis of noise initial detection, each piece of data is accurately judged again, and the purpose is to correct the wrongly-divided noise data; firstly, averaging the characteristics of all data in each subclass after the class clustering to be used as the characteristics of a query library, then comparing each piece of noise data which is judged as the noise data with the characteristics in the query library, judging the data as normal data if the similarity of the noise data and the characteristics in the query library reaches more than eighty percent, and not deleting the data in the denoising process; otherwise, it is judged as noise data and deleted.

2. The method for denoising data based on multi-scale reliable clustering according to claim 1, wherein:

in the step 2, during clustering, firstly traversing data, starting from a certain core object p, searching all density connecting points of the core object, adding the density connecting points into a cluster where p is located, searching core objects in direct density reachable points of p, adding the density connecting points of the objects into p, and performing recursive operation until the objects cannot be expanded; finally, the result returned by the clustering algorithm is a string of numbers which are as long as the original data size and are from minus one to plus infinity, the number at each position represents which category the data corresponding to the index is classified into after clustering, and the denoising process is completed as long as the picture with the corresponding clustering value of-1 in the data is found out.