CN108268876A - A kind of detection method and device of the approximately duplicate record based on cluster - Google Patents

A kind of detection method and device of the approximately duplicate record based on cluster Download PDF

Info

Publication number
CN108268876A
CN108268876A CN201611257674.5A CN201611257674A CN108268876A CN 108268876 A CN108268876 A CN 108268876A CN 201611257674 A CN201611257674 A CN 201611257674A CN 108268876 A CN108268876 A CN 108268876A
Authority
CN
China
Prior art keywords
cluster
canopy
duplicate record
point
approximately duplicate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611257674.5A
Other languages
Chinese (zh)
Inventor
简宋全
李青海
侯大勇
邹立斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Fine Point Data Polytron Technologies Inc
Original Assignee
Guangdong Fine Point Data Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Fine Point Data Polytron Technologies Inc filed Critical Guangdong Fine Point Data Polytron Technologies Inc
Priority to CN201611257674.5A priority Critical patent/CN108268876A/en
Publication of CN108268876A publication Critical patent/CN108268876A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, the method comprising the steps of S1:First pairing approximation, which is clustered, using Canopy repeats " thick " cluster of record progress;Step S2:Cluster based on K means is carried out to the point in Canopy, obtains approximately duplicate record;Step S3:Clean approximately duplicate record.Compared with prior art:The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, approximately duplicate record is clustered by Canopy clustering methods and K means clustering methods, it ensure that the levels of precision of higher detection approximately duplicate record, improve the efficiency of detection approximately duplicate record;The Canopy that the present invention creates is not too large and Canopy between be overlapped few, greatly reduce the number of the follow-up object for needing to calculate similitude in this way, so as to reduce calculation amount, reduce memory requirements;The threshold value in K means algorithms in the present invention is determined by Canopy number, reduces the blindness of selection threshold value.

Description

A kind of detection method and device of the approximately duplicate record based on cluster
Technical field
The present invention relates to data quality monitoring technical fields, and in particular to a kind of inspection of the approximately duplicate record based on cluster Survey method and device.
Background technology
With the fast development of information technology, data, which are increasingly becoming, realizes that business event is worth one of most important resource. However as the continuous increase of data volume, data quality problem is also following, shortage of data, mistake, it is inconsistent the problems such as make Enterprise is hindered to the application of data, and then causes trust crisis.Therefore, seem particularly for the cleaning of these " dirty datas " It is important, and the core of data cleansing activity is cleaning approximately duplicate record.Approximately duplicate record is exactly a live entities, can It can be represented by multiple and different records.The reason of generating approximately duplicate record is including misspelling, different abbreviations and freedom Text string of form etc..The method of more commonly used detection approximately duplicate record is sequence neighbours' method, i.e., first in database Record be ranked up, then comparing in neighbouring record as Pair-Wise, the distance between record calculated, so that it is determined that being No is approximately duplicate record.
The method needs of above-mentioned detection approximately duplicate record are ranked up entire database, not only computationally intensive, efficiency It is low and also bigger to memory requirements.In addition, it is necessary to there are certain mistakes for the data of sequence, it is impossible to ensure the near of sequence Like record is repeated centainly in adjacent position, it can cause the approximately duplicate record of some that cannot be detected in this way.
In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.
Invention content
To solve above-mentioned technological deficiency, the technical solution adopted by the present invention is, provides a kind of approximation weight based on cluster The detection method recorded again, this method include the following steps:
Step S1:" thick " cluster is first carried out to the approximately duplicate record of database using Canopy clusters, i.e., data are located in advance Reason;
Step S2:Cluster based on K-means is carried out to the point in Canopy, obtains approximately duplicate record;
Step S3:Clean approximately duplicate record.
Preferably, the step S1 is specifically included:
Step S11:All the points in database are put into a center point list, centered on put it is alternative;
Step S12:Randomly choose point centered on a point in list, be denoted as point A, being less than with central point distance or Equal to threshold value T1Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T2 Point;
Step S13:Whether inspection center point list is empty set, if empty set, then end operation;If not empty set, Then repeat step S12.
Preferably, the step S2 is specifically included:
Step S21:Calculate the distance { d of n data point between any twoij, it is denoted as D={ dij:1≤i,j≤n};
Step S22:N class is constructed, a data point is only included in each class;
Step S23:Compare to obtain two closest classes, if distance is less than or equal to threshold value k, two classes merged, Go to step S24;Otherwise step S25 is gone to;
Step S24:The distance of new class and current class is calculated with maximum distance method, if the number of class is equal to 1, goes to step S25;Otherwise step S23 is returned to;
Step S25:Output record number is greater than or equal to 2 cluster.
Preferably, the threshold value k is determined by Canopy number in the step S1.
A kind of detection device of the approximately duplicate record based on cluster, including:
Canopy cluster modules, for carrying out " thick " cluster to the approximately duplicate record of database;
K-means cluster modules, for carrying out K-means clusters to the Canopy of establishment;
Approximately duplicate record processing module, for removing approximately duplicate record.
Compared with prior art, the beneficial effects of the present invention are:The present invention provides a kind of approximation weights based on cluster The detection method and device recorded again cluster approximately duplicate record by Canopy clustering methods and K-means clustering methods, It ensure that the levels of precision of higher detection approximately duplicate record, improve the efficiency of detection approximately duplicate record;Present invention wound The Canopy built is not too large and Canopy between be overlapped few, greatly reduce follow-up pair for needing to calculate similitude in this way The number of elephant so as to reduce calculation amount, reduces memory requirements;The threshold value in K-means algorithms in the present invention passes through Canopy number determines, reduces the blindness of selection threshold value.
Description of the drawings
It is required in being described below to embodiment in order to illustrate more clearly of the technical solution in various embodiments of the present invention The attached drawing used is briefly described.
Fig. 1 is a kind of flow diagram of the detection method of approximately duplicate record based on cluster of the present invention;
Fig. 2 is the schematic diagram for creating Canopy;
Fig. 3 is the flow diagram of step S1;
Fig. 4 is the flow diagram of step S2;
Fig. 5 is a kind of detection device schematic diagram of approximately duplicate record based on cluster of the present invention.
Specific embodiment
Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.
As shown in Figure 1, show for a kind of flow of the detection method of the approximately duplicate record based on cluster provided by the invention It is intended to, this method includes the following steps:
Step S1:" thick " cluster is first carried out to the approximately duplicate record of database using Canopy clusters, i.e., data are located in advance Reason.Subsets of some overlappings, referred to as Canopy, but not right there are some are namely created using approximate distance calculating method Situation as being not belonging to any Canopy.
Specifically, in the case of in face of mass data, a center point set is set first, using a kind of expense pole Small approximate calculation method finds all data points of central point peripheral region, forms a Canopy, then finds again next Region around a central point equally also forms a Canopy, the cluster until seeking all over all data points in the point set of center Until.In order to which the point for ensureing all is all present in Canopy, the center point range of setting should be wide as much as possible, will make here With two distance threshold T1, T2(T1≥T2)。
As shown in Fig. 2, the schematic diagram to create Canopy, wherein, a circle represents a cover (Canopy), first Point A is randomly selected central point, selects a kind of cost less and approximate distance calculating method calculates the distance recorded between point, It is less than or equal to T with A points distance1Point be put into a Canopy, 10 points are shared in figure and are put into first Canopy, from It is deleted in list and is less than or equal to T with A points distance2Point, be 4 points in Fig. 2 near A points, create other Canopy's Process and so on.As shown in Fig. 2, if the point of database, than comparatively dense, the Canopy being created that may be overlapping, scheme It is represented respectively with different types of point there are five clustering altogether in 2.It, at least can there are a Canopy for each cluster Include this cluster completely, and accurate distance calculating method is served only for the data point in same Canopy, because of Canopy In point be far less than entire data set point, so substantially reducing calculation amount.In addition, the Canopy that the present invention creates will not It is overlapped few between too big and Canopy, greatly reduces the number of the follow-up object for needing to calculate similitude in this way, so as to Calculation amount is reduced, reduces memory requirements
As shown in figure 3, the flow diagram for step S1, step S1 is specifically included:
Step S11:All the points in database are put into a center point list, centered on put it is alternative.
Step S12:Randomly choose point centered on a point in list, be denoted as point A, being less than with central point distance or Equal to threshold value T1Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T2 Point.As fruit dot A was once less than or equal to threshold value T with the distance of some Canopy2, then need point A from the point list of center It deletes, thinks that point A and this Canopy is near enough, therefore it cannot do the center of other Canopy again at this time.
Step S13:Whether inspection center point list is empty set, if empty set, then end operation;If not empty set, Then repeat step S12.
Step S2:Cluster based on K-means is carried out to the point in Canopy, obtains approximately duplicate record.
The present invention selects the clustering method of k-means, i.e., if the distance between two closest clusters are more than threshold value K or cluster numbers are 1, then terminate cluster operation.Wherein, threshold value k is determined by the Canopy number of step S1, in this way, subtracting The blindness of selection threshold value is lacked.
K-means algorithms are the clustering algorithms based on division, and basic task is that data set is divided into several not phases The cluster of friendship, so that the similarity of cluster class is higher, and the similarity between cluster is relatively low.It is not belonging between the object of same Canopy Without Similarity measures, the similitude between some object and the object of other Canopy refers in the point to this Canopy The distance of the heart.Assuming that CanopyiA points of shared n (n >=2), di,dj, (1≤i, j≤n) is CanopyiAny two points in collection. Specific sorting procedure is as follows:
As shown in figure 4, the flow diagram for step S2, step S2 is specifically included:
Step S21:Calculate the distance { d of n data point between any twoij, it is denoted as D={ dij:1≤i,j≤n};
Step S22:N class is constructed, a data point is only included in each class;
Step S23:Compare to obtain two closest classes, if distance is less than or equal to threshold value k, two classes merged, Go to step S24;Otherwise step S25 is gone to;
Step S24:The distance of new class and current class is calculated with maximum distance method, if the number of class is equal to 1, goes to step S25;Otherwise step S23 is returned to;
Step S25:Output record number is greater than or equal to 2 cluster.
Approximately duplicate record in database has been polymerized to class by above step, ensure that approximately duplicate record in same cluster In.
Step S3:Clean approximately duplicate record.
After Canopy recited above clusters and K-means clusters, the data record in of a sort cluster set can To regard approximately duplicate record as, we only retain a data record nearest from central point, and other data records can be regarded as Approximately duplicate record deletes these approximately duplicate records, and the cleaning of record is repeated so as to complete pairing approximation.
As shown in figure 5, a kind of detection device schematic diagram of approximately duplicate record based on cluster for the present invention, based on The detection device of the approximately duplicate record of cluster includes Canopy cluster modules, K-means cluster modules and approximately duplicate record Processing module.Canopy cluster modules are used to carry out " thick " cluster to the approximately duplicate record of database;K-means cluster modules For carrying out K-means clusters to the Canopy of establishment;Approximately duplicate record processing module carries for removing approximately duplicate record The high quality of data.
The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, are clustered by Canopy Method and K-means clustering methods cluster approximately duplicate record, ensure that the accurate journey of higher detection approximately duplicate record Degree improves the efficiency of detection approximately duplicate record;The Canopy that the present invention creates is not too large and Canopy between be overlapped Seldom, the number of the follow-up object for needing to calculate similitude is greatly reduced in this way, so as to reduce calculation amount, reduces memory Demand;The threshold value in K-means algorithms in the present invention is determined by Canopy number, reduces the blindness of selection threshold value Property.
Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, It still can modify to the technical solution recorded in foregoing embodiments or which part technical characteristic is carried out etc. With replacing, all within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in this Within the protection domain of invention.

Claims (5)

1. a kind of detection method of the approximately duplicate record based on cluster, which is characterized in that this method includes the following steps:
Step S1:" thick " cluster, i.e. data prediction are first carried out to the approximately duplicate record of database using Canopy clusters;
Step S2:Cluster based on K-means is carried out to the point in Canopy, obtains approximately duplicate record;
Step S3:Clean approximately duplicate record.
2. the detection method of a kind of approximately duplicate record based on cluster according to claim 1, which is characterized in that described Step S1 is specifically included:
Step S11:All the points in database are put into a center point list, centered on put it is alternative;
Step S12:Point centered on a point in list is randomly choosed, point A is denoted as, being less than or equal to central point distance Threshold value T1Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T2Point;
Step S13:Whether inspection center point list is empty set, if empty set, then operation terminates;If not empty set, then weigh Multiple step S12.
3. the detection method of a kind of approximately duplicate record based on cluster according to claim 1, which is characterized in that described Step S2 is specifically included:
Step S21:Calculate the distance { d of n data point between any twoij, it is denoted as D={ dij:1≤i,j≤n};
Step S22:N class is constructed, a data point is only included in each class;
Step S23:Compare to obtain two closest classes, if distance is less than or equal to threshold value k, two classes are merged, are gone to Step S24;Otherwise step S25 is gone to;
Step S24:The distance of new class and current class is calculated with maximum distance method, if the number of class is equal to 1, goes to step S25; Otherwise step S23 is returned to;
Step S25:Output record number is greater than or equal to 2 cluster.
4. the detection method of a kind of approximately duplicate record based on cluster according to claim 3, which is characterized in that described Threshold value k is determined by Canopy number in the step S1.
5. a kind of detection device of the approximately duplicate record based on cluster, which is characterized in that it includes:
Canopy cluster modules, for carrying out " thick " cluster to the approximately duplicate record of database;
K-means cluster modules, for carrying out K-means clusters to the Canopy of establishment;
Approximately duplicate record processing module, for removing approximately duplicate record.
CN201611257674.5A 2016-12-30 2016-12-30 A kind of detection method and device of the approximately duplicate record based on cluster Pending CN108268876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611257674.5A CN108268876A (en) 2016-12-30 2016-12-30 A kind of detection method and device of the approximately duplicate record based on cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611257674.5A CN108268876A (en) 2016-12-30 2016-12-30 A kind of detection method and device of the approximately duplicate record based on cluster

Publications (1)

Publication Number Publication Date
CN108268876A true CN108268876A (en) 2018-07-10

Family

ID=62754649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611257674.5A Pending CN108268876A (en) 2016-12-30 2016-12-30 A kind of detection method and device of the approximately duplicate record based on cluster

Country Status (1)

Country Link
CN (1) CN108268876A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615017A (en) * 2018-12-21 2019-04-12 大连海事大学 Consider the Stack Overflow replication problem detection method of more reference factors
CN110232398A (en) * 2019-04-24 2019-09-13 广东交通职业技术学院 A kind of road network sub-area division and its appraisal procedure based on Canopy+Kmeans cluster
CN115829143A (en) * 2022-12-15 2023-03-21 广东慧航天唯科技有限公司 Water environment treatment prediction system and method based on time-space data cleaning technology

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN102135995A (en) * 2011-03-17 2011-07-27 新太科技股份有限公司 Extract transform and load (ETL) data cleaning design method
CN102982489A (en) * 2012-11-23 2013-03-20 广东电网公司电力科学研究院 Power customer online grouping method based on mass measurement data
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window
CN103793504A (en) * 2014-01-24 2014-05-14 北京理工大学 Cluster initial point selection method based on user preference and project properties
CN104298858A (en) * 2014-09-19 2015-01-21 南京邮电大学 Method for partitioning map in RoboCup rescue platform based on cluster and convex hull
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
CN104850624A (en) * 2015-05-20 2015-08-19 华东师范大学 Similarity evaluation method of approximately duplicate records

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986296A (en) * 2010-10-28 2011-03-16 浙江大学 Noise data cleaning method based on semantic ontology
CN102135995A (en) * 2011-03-17 2011-07-27 新太科技股份有限公司 Extract transform and load (ETL) data cleaning design method
CN102982489A (en) * 2012-11-23 2013-03-20 广东电网公司电力科学研究院 Power customer online grouping method based on mass measurement data
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window
CN103793504A (en) * 2014-01-24 2014-05-14 北京理工大学 Cluster initial point selection method based on user preference and project properties
CN104298858A (en) * 2014-09-19 2015-01-21 南京邮电大学 Method for partitioning map in RoboCup rescue platform based on cluster and convex hull
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
CN104850624A (en) * 2015-05-20 2015-08-19 华东师范大学 Similarity evaluation method of approximately duplicate records

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109615017A (en) * 2018-12-21 2019-04-12 大连海事大学 Consider the Stack Overflow replication problem detection method of more reference factors
CN109615017B (en) * 2018-12-21 2021-06-29 大连海事大学 Stack Overflow repeated problem detection method considering multiple reference factors
CN110232398A (en) * 2019-04-24 2019-09-13 广东交通职业技术学院 A kind of road network sub-area division and its appraisal procedure based on Canopy+Kmeans cluster
CN115829143A (en) * 2022-12-15 2023-03-21 广东慧航天唯科技有限公司 Water environment treatment prediction system and method based on time-space data cleaning technology

Similar Documents

Publication Publication Date Title
CN110245981B (en) Crowd type identification method based on mobile phone signaling data
CN108415975B (en) BDCH-DBSCAN-based taxi passenger carrying hot spot identification method
CN108764984A (en) A kind of power consumer portrait construction method and system based on big data
CN106202569A (en) A kind of cleaning method based on big data quantity
CN106530132A (en) Power load clustering method and device
Witayangkurn et al. Anomalous event detection on large-scale gps data from mobile phones using hidden markov model and cloud platform
CN106709035A (en) Preprocessing system for electric power multi-dimensional panoramic data
CN109800431B (en) Event information keyword extracting and monitoring method and system and storage and processing device
CN104598632B (en) Focus incident detection method and device
CN107273912A (en) A kind of Active Learning Method based on three decision theories
CN106055621A (en) Log retrieval method and device
CN108268876A (en) A kind of detection method and device of the approximately duplicate record based on cluster
CN112800115B (en) Data processing method and data processing device
CN106530685B (en) A kind of traffic data Forecasting Approach for Short-term and device
CN103336771A (en) Data similarity detection method based on sliding window
CN109446243A (en) A method of it is abnormal based on big data analysis detection photovoltaic power station power generation
CN106528705A (en) Repeated record detection method and system based on RBF neural network
CN109257383A (en) A kind of BGP method for detecting abnormality and system
CN104182539A (en) Abnormal information batch processing method and system
Yadamjav et al. Querying recurrent convoys over trajectory data
Sundarakumar et al. A heuristic approach to improve the data processing in big data using enhanced Salp Swarm algorithm (ESSA) and MK-means algorithm
CN107133335A (en) A kind of repetition record detection method based on participle and index technology
CN108021484A (en) The extension method and its system of disk life expectancy value in cloud service system
CN109522934A (en) A kind of power consumer clustering method based on clustering algorithm
CN108763283A (en) A kind of unbalanced dataset oversampler method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710