CN112801113A - Data denoising method based on multi-scale reliable clustering - Google Patents

Data denoising method based on multi-scale reliable clustering Download PDF

Info

Publication number
CN112801113A
CN112801113A CN202110173919.0A CN202110173919A CN112801113A CN 112801113 A CN112801113 A CN 112801113A CN 202110173919 A CN202110173919 A CN 202110173919A CN 112801113 A CN112801113 A CN 112801113A
Authority
CN
China
Prior art keywords
data
clustering
algorithm
noise
denoising
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110173919.0A
Other languages
Chinese (zh)
Inventor
王素玉
李越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110173919.0A priority Critical patent/CN112801113A/en
Publication of CN112801113A publication Critical patent/CN112801113A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a data denoising method based on multi-scale reliable clustering, and provides a single-class denoising algorithm for the early-stage processing of data in a commodity identification model, aiming at solving the problem of dirty and messy training data of deep learning. The algorithm finds out discrete points outside the clusters, namely noise data, in a clustering mode for each class, and eliminates the noise data, so that an end-to-end automatic data cleaning process is realized, the time for manually cleaning the data is saved, and the working efficiency of data processing is greatly improved. The experimental result shows that the precision obtained by using the network trained by the denoised data set is higher than that obtained by using the original data training network, and the good performance of the denoising method is reflected from the side surface.

Description

Data denoising method based on multi-scale reliable clustering
Technical Field
The invention belongs to the field of image processing and computer vision, and particularly relates to a data denoising method based on a reliable clustering method for scales.
Background
In recent years, a dbscan-based clustering method is widely applied to pedestrian re-identification data processing steps, the purpose is to realize unsupervised learning in a clustering mode, and in addition, the method can be used for data cleaning, but the clustering effect is poor due to the problems of uneven sample density distribution, large inter-class distance difference and the like during clustering. The most widely applied clustering method based on the density is DBSCAN, wherein a Heterology DBSCAN (HDBSCAN) algorithm is used for overcoming the defect that the clustering effect is not ideal under the condition of unbalanced density in the original DBSCAN algorithm, and the adaptability of the HDBSCAN algorithm to unbalanced data is enhanced by defining two parameters of longRegionQuery and ShortRegionQuery; the Enhanced DBSCAN (EDBSCAN) algorithm is also used for solving the problem of low clustering accuracy under the condition of non-uniform data density, and by calculating the distance density between different central points and all points, if the density is smaller than a preset threshold value, the central point is used for next clustering, otherwise, the point is assigned as a boundary point. In order to improve the operation efficiency of the algorithm, the L-DBSCAN is provided as a comprehensive efficient algorithm, the algorithm is used for improving the algorithm efficiency by selecting a proper clustering method based on the characteristics of a data set, and the method is suitable for clustering large-scale data sets.
Disclosure of Invention
The invention aims to solve the problems that: in the data processing before deep learning training, due to the fact that training data are dirty and messy, a large amount of noise data exist, accuracy during network training can be seriously affected, and the existing method can only screen the data in a manual cleaning mode and cannot efficiently process the data. A new data denoising method based on clustering needs to be provided, and an algorithm is used for cleaning data in a large scale.
In order to solve the problems, the invention provides a data denoising method based on multi-scale reliable clustering, which finds out the noise data at discrete points and eliminates the noise data by clustering all data, and saves the time for manually selecting the data, thereby achieving the purpose of cleaning the data. As shown in fig. 1, the data denoising method based on multi-scale reliable clustering includes the following steps:
step 1, data feature extraction: the data selected in the experiment is commodity identification data for a match of CVPR 2020AliProducts Change: Large-scale Product registration, a total of 5 million sku classes and 300 pieces of data. Firstly, all data are subjected to feature extraction through a pre-trained commodity identification network model, the features of each piece of data are stored in front of an FC full connection layer, the data are converted into 1 × 2048 tensors, corresponding labels are generated at the same time, and the corresponding labels are stored into a data file in a format of npy.
And 2, clustering by an algorithm. We use the dbscan algorithm as a clustering algorithm and make improvements on it. And (3) performing a dbscan clustering algorithm on the data in each sku category once on the basis of the step 1. In the returned results, the labels of the data of the non-discrete points are between 0 and positive infinity, representing the class number of the respective cluster, while the label of the noise data of the discrete points is-1.
And step 3, carrying out noise initial detection. And through the clustering method in the second step, acquiring a data index with a label of-1, then acquiring a storage path of each piece of data, and deleting each piece of discrete data according to the path, thereby completing the data denoising process.
And 4, deeply thinning. In the process of noise preliminary examination, if the data base number of the class is too large, the clustering effect is reduced, forty percent of data can be judged as noise data, and a large number of correct samples are contained in the noise data, so that the distribution of training data is influenced. Firstly, the feature average value of all data in each subclass after the class clustering is calculated to be used as the feature of a query library, then each piece of noise data judged as the noise data is compared with the feature in the query library, if the similarity of the noise data and the feature in the query library reaches more than eighty percent, the noise data can be judged to be misjudged, and the noise data is not deleted in the denoising process. By performing such a secondary judgment on the noise data of each category, we can reduce a large misjudgment rate.
As a further preferable mode, the step 2) comprises the following specific steps:
in the clustering process, a dbscan clustering algorithm is adopted, and compared with kmeans clustering, the dbscan does not need any parameter and is suitable for convex samples and non-convex samples. Dense data sets of arbitrary shape can be clustered. dbscan has four important parameters: direct density, reachable density, eps and minisample. As shown in fig. 2, density through means that object a is within the domain of object B; the density can be reached, namely the object chain ABC exists, the density between AB is directly reached, the density between BC is directly reached, and the AC density can be reached; eps is the radius of the field at which the density is defined; the minimum is a threshold value of a core point, when clustering is carried out, firstly, data is traversed, starting from a certain core object p, all density connecting points of the core object are searched, the density connecting points are added into a cluster where p is located, core objects in direct density reachable points of p are searched, the density connecting points of the objects are also added into p, and the operation is carried out in a recursive mode until the core objects cannot be expanded any more. Finally, the result returned by the clustering algorithm is a string of numbers which are as long as the original data size and are between negative one and positive infinity, the numbers at each position represent the category to which the data corresponding to the index is clustered, and the denoising process is completed as long as the picture with the corresponding clustering value of-1 in the data is found out. As shown in fig. 2, a denotes a core object, BC denotes a boundary point, and N denotes an outlier. And point N is the noisy data we find by clustering.
In order to verify the denoising performance of the clustering algorithm, 8632 pieces of 40 types of head data and 8632 pieces of 20 types of tail data are selected from the AliProducts Challenge data sets for denoising experiments. There are ten percent of noise data in each category. The data are clustered by adopting a dbscan clustering algorithm, and noise data are found and eliminated by adjusting two thresholds of eps and min _ sample, so that the data are cleaned, and the following is detailed information of experiments.
Figure RE-GDA0003005623730000031
It can be seen from the table that when the dbscan clustering algorithm is used for performing experiments on data, noise data of about ninety percent can be removed, but meanwhile, in the clustering process of the clustering algorithm, the linkage of two parameters, namely eps and min _ sample, is complex, and the algorithm has limitation, so that a part of noise data can be omitted in the clustering process, and the algorithm with a better and more reliable clustering effect is researched.
We propose a reliable clustering approach based on multi-scale. Two parameters are first defined: independence of clusters and compactness of clusters.
Independence of clusters. One reliable cluster should be independent of other clusters and individual samples. Intuitively, a cluster can be considered highly independent if it is far from other samples. However, we cannot use the distance of the cluster centroid from the off-cluster samples to measure cluster independence due to the uneven density of the underlying space. In general, the clustering results can be adjusted by changing some hyper-parameters of the clustering criteria. Clustering criteria may be relaxed so that each cluster may contain more samples, or tightened so that each cluster may contain fewer samples. We use
Figure RE-GDA00030056237300000411
Representing a cluster
Figure RE-GDA00030056237300000412
In clustering samples using the dbscan clustering algorithm, we propose the following metric to measure cluster independence, which is expressed as a score of the cross-over ratio.
Figure RE-GDA0003005623730000041
As can be seen from the equation 1,
Figure RE-GDA0003005623730000042
when it is the case that the clustering criterion becomes loose,
Figure RE-GDA0003005623730000043
the cluster set of (a) is selected,
Figure RE-GDA0003005623730000044
the larger the size of the tube is,
Figure RE-GDA0003005623730000045
the more independent the clustering cluster is; i.e. if one looses the clustering criteria, no more new samples will be included in the new clusters. As can be seen from the third upper half of the figure, if a cluster has poor independence, some clusters that do not belong to the category may be mistakenly classified into the category after the clustering criterion is relaxed, so that the clustering shape of the cluster may be changed, thereby affecting the clustering performance.
Compactness of the cluster. A reliable cluster should also be compact, i.e. samples within the same cluster should have a small inter-sample distance. In the extreme case, when a cluster is the most compact, the distance between all samples within the cluster is zero, and even if the clustering criterion is tightened, its samples are not segmented into different clusters. Based on this assumption, we can define the following metric to determine compactness within a cluster as:
Figure RE-GDA0003005623730000046
Figure RE-GDA0003005623730000047
when the clustering criterion becomes strict
Figure RE-GDA0003005623730000048
The cluster set of (a) is selected,
Figure RE-GDA0003005623730000049
the larger is indicated at
Figure RE-GDA00030056237300000410
The smaller the sample spacing around after clustering. As can be seen in the lower part of fig. 3, when a stricter criterion is adopted, clusters having a larger distance between samples contain fewer points, so that clusters originally belonging to one class are divided into two or more classes, which we call clusters have poor compactness.
According to the index for measuring the cluster reliability, the independence and compactness score of each data point in the cluster can be calculated, so that each piece of data is judged more finely, and the noise data which is missed to be detected is found out.
In order to verify the denoising performance of the multi-scale reliable clustering algorithm, the same data set is adopted to set a contrast experiment, and the following experiment results show that all noise data can be removed from head data, and the denoising process of tail data has good performance.
Figure RE-GDA0003005623730000051
The comparison experiment shows that the method based on the multi-scale reliable clustering has a much better effect compared with other clustering methods, and finally, the method decides to select and use the improved dbscan clustering algorithm for denoising.
The core technology of this patent includes:
(1) and a brand new clustering idea is adopted, and the accuracy of data clustering is improved by using a multi-scale reliable clustering method.
(2) An end-to-end data automatic cleaning model is constructed, automatic cleaning of data is achieved, and ninety percent of noisy data can be removed.
(3) The model is trained after the data are subjected to early-stage processing by using a data cleaning method, so that the recognition accuracy of one percent of the model is improved.
Drawings
FIG. 1 is a flow chart of the data denoising method based on multi-scale reliable clustering of the present invention
FIG. 2 is a schematic diagram of the dbscan clustering algorithm principle of the present invention
FIG. 3 is a schematic diagram of the principle of the multi-scale reliable clustering algorithm of the present invention
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
Fifty thousand categories of data sets are downloaded from a CVPR 2020 AliProduces Challenge Large-scale Product Recognation game, 20 categories of head data and 20 categories of tail data are randomly selected, and the 40 categories of 8632 data are subjected to feature extraction on our Recognition model, and the labels and paths of the data are saved in a format of npy.
After the characteristics are obtained, clustering is carried out on the data by using a multi-scale reliable clustering algorithm, and whether the cluster clustering stability is good or not is judged according to the jitter condition of independence and compactness of each piece of data after clustering, so that more noise data are filtered, and very clean training data are obtained.
On the premise of not changing the training mode, the training data of the commodity recognition model is subjected to integral denoising by using the method, and the accuracy of model recognition is improved by one percent according to a chart.
Figure RE-GDA0003005623730000061

Claims (2)

1. The data denoising algorithm based on the multi-scale reliable clustering is characterized by comprising the following four steps:
step 1, data feature extraction: extracting the characteristics of all data through a pre-trained commodity identification network model, storing the characteristics of each data in front of an FC full connection layer, converting the characteristics into 1 × 2048 tensors, generating corresponding labels at the same time, and storing the labels into a data file with a format of npy;
step 2, algorithm clustering; using dbscan algorithm as clustering algorithm, performing dbscan clustering algorithm once on the data in each sku category, wherein in the returned result, the label of the data of the non-discrete point is between 0 and positive infinity, the label represents the category number of each cluster, and the label of the noise data of the discrete point is-1;
step 3, noise initial detection; acquiring a data index with a label of-1 by the clustering method in the step 2, then acquiring a storage path of each piece of data, and deleting each piece of discrete data according to the path, thereby completing the data denoising process;
step 4, deep thinning; on the basis of noise initial detection, each piece of data is accurately judged again, and the purpose is to correct the wrongly-divided noise data; firstly, averaging the characteristics of all data in each subclass after the class clustering to be used as the characteristics of a query library, then comparing each piece of noise data which is judged as the noise data with the characteristics in the query library, judging the data as normal data if the similarity of the noise data and the characteristics in the query library reaches more than eighty percent, and not deleting the data in the denoising process; otherwise, it is judged as noise data and deleted.
2. The method for denoising data based on multi-scale reliable clustering according to claim 1, wherein:
in the step 2, during clustering, firstly traversing data, starting from a certain core object p, searching all density connecting points of the core object, adding the density connecting points into a cluster where p is located, searching core objects in direct density reachable points of p, adding the density connecting points of the objects into p, and performing recursive operation until the objects cannot be expanded; finally, the result returned by the clustering algorithm is a string of numbers which are as long as the original data size and are from minus one to plus infinity, the number at each position represents which category the data corresponding to the index is classified into after clustering, and the denoising process is completed as long as the picture with the corresponding clustering value of-1 in the data is found out.
CN202110173919.0A 2021-02-09 2021-02-09 Data denoising method based on multi-scale reliable clustering Pending CN112801113A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110173919.0A CN112801113A (en) 2021-02-09 2021-02-09 Data denoising method based on multi-scale reliable clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110173919.0A CN112801113A (en) 2021-02-09 2021-02-09 Data denoising method based on multi-scale reliable clustering

Publications (1)

Publication Number Publication Date
CN112801113A true CN112801113A (en) 2021-05-14

Family

ID=75814852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110173919.0A Pending CN112801113A (en) 2021-02-09 2021-02-09 Data denoising method based on multi-scale reliable clustering

Country Status (1)

Country Link
CN (1) CN112801113A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705605A (en) * 2021-07-20 2021-11-26 中国人民解放军海军大连舰艇学院 Automatic cleaning method for abnormal values of multi-beam sounding data with partial manual intervention
CN114661968A (en) * 2022-05-26 2022-06-24 卡奥斯工业智能研究院(青岛)有限公司 Product data processing method, device and storage medium
CN113705605B (en) * 2021-07-20 2024-05-31 中国人民解放军海军大连舰艇学院 Automatic cleaning method for abnormal values of multi-beam sounding data through partial manual intervention

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240219A1 (en) * 2017-02-22 2018-08-23 Siemens Healthcare Gmbh Denoising medical images by learning sparse image representations with a deep unfolding approach
CN110232420A (en) * 2019-06-21 2019-09-13 安阳工学院 A kind of clustering method of data
CN111367901A (en) * 2020-02-27 2020-07-03 智慧航海(青岛)科技有限公司 Ship data denoising method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240219A1 (en) * 2017-02-22 2018-08-23 Siemens Healthcare Gmbh Denoising medical images by learning sparse image representations with a deep unfolding approach
CN110232420A (en) * 2019-06-21 2019-09-13 安阳工学院 A kind of clustering method of data
CN111367901A (en) * 2020-02-27 2020-07-03 智慧航海(青岛)科技有限公司 Ship data denoising method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705605A (en) * 2021-07-20 2021-11-26 中国人民解放军海军大连舰艇学院 Automatic cleaning method for abnormal values of multi-beam sounding data with partial manual intervention
CN113705605B (en) * 2021-07-20 2024-05-31 中国人民解放军海军大连舰艇学院 Automatic cleaning method for abnormal values of multi-beam sounding data through partial manual intervention
CN114661968A (en) * 2022-05-26 2022-06-24 卡奥斯工业智能研究院(青岛)有限公司 Product data processing method, device and storage medium

Similar Documents

Publication Publication Date Title
CN109241317B (en) Pedestrian Hash retrieval method based on measurement loss in deep learning network
US20160196467A1 (en) Three-Dimensional Face Recognition Device Based on Three Dimensional Point Cloud and Three-Dimensional Face Recognition Method Based on Three-Dimensional Point Cloud
TWI321294B (en) Method and device for determining at least one recognition candidate for a handwritten pattern
CN110472082B (en) Data processing method, data processing device, storage medium and electronic equipment
US20040165777A1 (en) On-line handwriting recognizer
US20070058856A1 (en) Character recoginition in video data
CN104615642B (en) The erroneous matching detection method of the space checking constrained based on local neighborhood
Kour et al. Real-time segmentation of on-line handwritten arabic script
WO2016153697A1 (en) Multi-layer skin detection and fused hand pose matching
JP6897749B2 (en) Learning methods, learning systems, and learning programs
CN102254015A (en) Image retrieval method based on visual phrases
CN107908642B (en) Industry text entity extraction method based on distributed platform
CN103970733A (en) New Chinese word recognition method based on graph structure
CN109190625A (en) A kind of container number identification method of wide-angle perspective distortion
Rebelo et al. Staff line detection and removal in the grayscale domain
CN115062186B (en) Video content retrieval method, device, equipment and storage medium
CN112766218A (en) Cross-domain pedestrian re-identification method and device based on asymmetric joint teaching network
CN112801113A (en) Data denoising method based on multi-scale reliable clustering
JP4958236B2 (en) Method and apparatus for recognizing handwritten patterns
CN108345942B (en) Machine learning identification method based on embedded code learning
Xiao et al. Trajectories-based motion neighborhood feature for human action recognition
CN116704490B (en) License plate recognition method, license plate recognition device and computer equipment
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN112541328B (en) Handwriting storage method, device, equipment and storage medium
CN111382703B (en) Finger vein recognition method based on secondary screening and score fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination