CN108268876A

CN108268876A - A kind of detection method and device of the approximately duplicate record based on cluster

Info

Publication number: CN108268876A
Application number: CN201611257674.5A
Authority: CN
Inventors: 简宋全; 李青海; 侯大勇; 邹立斌
Original assignee: Guangdong Fine Point Data Polytron Technologies Inc
Current assignee: Guangdong Fine Point Data Polytron Technologies Inc
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2018-07-10

Abstract

The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, the method comprising the steps of S1：First pairing approximation, which is clustered, using Canopy repeats " thick " cluster of record progress；Step S2：Cluster based on K means is carried out to the point in Canopy, obtains approximately duplicate record；Step S3：Clean approximately duplicate record.Compared with prior art：The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, approximately duplicate record is clustered by Canopy clustering methods and K means clustering methods, it ensure that the levels of precision of higher detection approximately duplicate record, improve the efficiency of detection approximately duplicate record；The Canopy that the present invention creates is not too large and Canopy between be overlapped few, greatly reduce the number of the follow-up object for needing to calculate similitude in this way, so as to reduce calculation amount, reduce memory requirements；The threshold value in K means algorithms in the present invention is determined by Canopy number, reduces the blindness of selection threshold value.

Description

A kind of detection method and device of the approximately duplicate record based on cluster

Technical field

The present invention relates to data quality monitoring technical fields, and in particular to a kind of inspection of the approximately duplicate record based on cluster Survey method and device.

Background technology

With the fast development of information technology, data, which are increasingly becoming, realizes that business event is worth one of most important resource. However as the continuous increase of data volume, data quality problem is also following, shortage of data, mistake, it is inconsistent the problems such as make Enterprise is hindered to the application of data, and then causes trust crisis.Therefore, seem particularly for the cleaning of these " dirty datas " It is important, and the core of data cleansing activity is cleaning approximately duplicate record.Approximately duplicate record is exactly a live entities, can It can be represented by multiple and different records.The reason of generating approximately duplicate record is including misspelling, different abbreviations and freedom Text string of form etc..The method of more commonly used detection approximately duplicate record is sequence neighbours' method, i.e., first in database Record be ranked up, then comparing in neighbouring record as Pair-Wise, the distance between record calculated, so that it is determined that being No is approximately duplicate record.

The method needs of above-mentioned detection approximately duplicate record are ranked up entire database, not only computationally intensive, efficiency It is low and also bigger to memory requirements.In addition, it is necessary to there are certain mistakes for the data of sequence, it is impossible to ensure the near of sequence Like record is repeated centainly in adjacent position, it can cause the approximately duplicate record of some that cannot be detected in this way.

In view of drawbacks described above, creator of the present invention obtains the present invention finally by prolonged research and practice.

Invention content

To solve above-mentioned technological deficiency, the technical solution adopted by the present invention is, provides a kind of approximation weight based on cluster The detection method recorded again, this method include the following steps：

Step S1：" thick " cluster is first carried out to the approximately duplicate record of database using Canopy clusters, i.e., data are located in advance Reason；

Step S2：Cluster based on K-means is carried out to the point in Canopy, obtains approximately duplicate record；

Step S3：Clean approximately duplicate record.

Preferably, the step S1 is specifically included：

Step S11：All the points in database are put into a center point list, centered on put it is alternative；

Step S12：Randomly choose point centered on a point in list, be denoted as point A, being less than with central point distance or Equal to threshold value T₁Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T₂ Point；

Step S13：Whether inspection center point list is empty set, if empty set, then end operation；If not empty set, Then repeat step S12.

Preferably, the step S2 is specifically included：

Step S21：Calculate the distance { d of n data point between any two_ij, it is denoted as D={ d_ij:1≤i,j≤n}；

Step S22：N class is constructed, a data point is only included in each class；

Step S23：Compare to obtain two closest classes, if distance is less than or equal to threshold value k, two classes merged, Go to step S24；Otherwise step S25 is gone to；

Step S24：The distance of new class and current class is calculated with maximum distance method, if the number of class is equal to 1, goes to step S25；Otherwise step S23 is returned to；

Step S25：Output record number is greater than or equal to 2 cluster.

Preferably, the threshold value k is determined by Canopy number in the step S1.

A kind of detection device of the approximately duplicate record based on cluster, including：

Canopy cluster modules, for carrying out " thick " cluster to the approximately duplicate record of database；

K-means cluster modules, for carrying out K-means clusters to the Canopy of establishment；

Approximately duplicate record processing module, for removing approximately duplicate record.

Compared with prior art, the beneficial effects of the present invention are：The present invention provides a kind of approximation weights based on cluster The detection method and device recorded again cluster approximately duplicate record by Canopy clustering methods and K-means clustering methods, It ensure that the levels of precision of higher detection approximately duplicate record, improve the efficiency of detection approximately duplicate record；Present invention wound The Canopy built is not too large and Canopy between be overlapped few, greatly reduce follow-up pair for needing to calculate similitude in this way The number of elephant so as to reduce calculation amount, reduces memory requirements；The threshold value in K-means algorithms in the present invention passes through Canopy number determines, reduces the blindness of selection threshold value.

Description of the drawings

It is required in being described below to embodiment in order to illustrate more clearly of the technical solution in various embodiments of the present invention The attached drawing used is briefly described.

Fig. 1 is a kind of flow diagram of the detection method of approximately duplicate record based on cluster of the present invention；

Fig. 2 is the schematic diagram for creating Canopy；

Fig. 3 is the flow diagram of step S1；

Fig. 4 is the flow diagram of step S2；

Fig. 5 is a kind of detection device schematic diagram of approximately duplicate record based on cluster of the present invention.

Specific embodiment

Below in conjunction with attached drawing, the forgoing and additional technical features and advantages are described in more detail.

As shown in Figure 1, show for a kind of flow of the detection method of the approximately duplicate record based on cluster provided by the invention It is intended to, this method includes the following steps：

Step S1：" thick " cluster is first carried out to the approximately duplicate record of database using Canopy clusters, i.e., data are located in advance Reason.Subsets of some overlappings, referred to as Canopy, but not right there are some are namely created using approximate distance calculating method Situation as being not belonging to any Canopy.

Specifically, in the case of in face of mass data, a center point set is set first, using a kind of expense pole Small approximate calculation method finds all data points of central point peripheral region, forms a Canopy, then finds again next Region around a central point equally also forms a Canopy, the cluster until seeking all over all data points in the point set of center Until.In order to which the point for ensureing all is all present in Canopy, the center point range of setting should be wide as much as possible, will make here With two distance threshold T₁, T₂(T₁≥T₂)。

As shown in Fig. 2, the schematic diagram to create Canopy, wherein, a circle represents a cover (Canopy), first Point A is randomly selected central point, selects a kind of cost less and approximate distance calculating method calculates the distance recorded between point, It is less than or equal to T with A points distance₁Point be put into a Canopy, 10 points are shared in figure and are put into first Canopy, from It is deleted in list and is less than or equal to T with A points distance₂Point, be 4 points in Fig. 2 near A points, create other Canopy's Process and so on.As shown in Fig. 2, if the point of database, than comparatively dense, the Canopy being created that may be overlapping, scheme It is represented respectively with different types of point there are five clustering altogether in 2.It, at least can there are a Canopy for each cluster Include this cluster completely, and accurate distance calculating method is served only for the data point in same Canopy, because of Canopy In point be far less than entire data set point, so substantially reducing calculation amount.In addition, the Canopy that the present invention creates will not It is overlapped few between too big and Canopy, greatly reduces the number of the follow-up object for needing to calculate similitude in this way, so as to Calculation amount is reduced, reduces memory requirements

As shown in figure 3, the flow diagram for step S1, step S1 is specifically included：

Step S11：All the points in database are put into a center point list, centered on put it is alternative.

Step S12：Randomly choose point centered on a point in list, be denoted as point A, being less than with central point distance or Equal to threshold value T₁Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T₂ Point.As fruit dot A was once less than or equal to threshold value T with the distance of some Canopy₂, then need point A from the point list of center It deletes, thinks that point A and this Canopy is near enough, therefore it cannot do the center of other Canopy again at this time.

Step S2：Cluster based on K-means is carried out to the point in Canopy, obtains approximately duplicate record.

The present invention selects the clustering method of k-means, i.e., if the distance between two closest clusters are more than threshold value K or cluster numbers are 1, then terminate cluster operation.Wherein, threshold value k is determined by the Canopy number of step S1, in this way, subtracting The blindness of selection threshold value is lacked.

K-means algorithms are the clustering algorithms based on division, and basic task is that data set is divided into several not phases The cluster of friendship, so that the similarity of cluster class is higher, and the similarity between cluster is relatively low.It is not belonging between the object of same Canopy Without Similarity measures, the similitude between some object and the object of other Canopy refers in the point to this Canopy The distance of the heart.Assuming that Canopy_iA points of shared n (n >=2), d_i,d_j, (1≤i, j≤n) is Canopy_iAny two points in collection. Specific sorting procedure is as follows：

As shown in figure 4, the flow diagram for step S2, step S2 is specifically included：

Step S25：Output record number is greater than or equal to 2 cluster.

Approximately duplicate record in database has been polymerized to class by above step, ensure that approximately duplicate record in same cluster In.

Step S3：Clean approximately duplicate record.

After Canopy recited above clusters and K-means clusters, the data record in of a sort cluster set can To regard approximately duplicate record as, we only retain a data record nearest from central point, and other data records can be regarded as Approximately duplicate record deletes these approximately duplicate records, and the cleaning of record is repeated so as to complete pairing approximation.

As shown in figure 5, a kind of detection device schematic diagram of approximately duplicate record based on cluster for the present invention, based on The detection device of the approximately duplicate record of cluster includes Canopy cluster modules, K-means cluster modules and approximately duplicate record Processing module.Canopy cluster modules are used to carry out " thick " cluster to the approximately duplicate record of database；K-means cluster modules For carrying out K-means clusters to the Canopy of establishment；Approximately duplicate record processing module carries for removing approximately duplicate record The high quality of data.

The present invention provides a kind of detection method and device of the approximately duplicate record based on cluster, are clustered by Canopy Method and K-means clustering methods cluster approximately duplicate record, ensure that the accurate journey of higher detection approximately duplicate record Degree improves the efficiency of detection approximately duplicate record；The Canopy that the present invention creates is not too large and Canopy between be overlapped Seldom, the number of the follow-up object for needing to calculate similitude is greatly reduced in this way, so as to reduce calculation amount, reduces memory Demand；The threshold value in K-means algorithms in the present invention is determined by Canopy number, reduces the blindness of selection threshold value Property.

Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, It still can modify to the technical solution recorded in foregoing embodiments or which part technical characteristic is carried out etc. With replacing, all within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in this Within the protection domain of invention.

Claims

1. a kind of detection method of the approximately duplicate record based on cluster, which is characterized in that this method includes the following steps：

Step S1：" thick " cluster, i.e. data prediction are first carried out to the approximately duplicate record of database using Canopy clusters；

Step S3：Clean approximately duplicate record.

2. the detection method of a kind of approximately duplicate record based on cluster according to claim 1, which is characterized in that described Step S1 is specifically included：

Step S12：Point centered on a point in list is randomly choosed, point A is denoted as, being less than or equal to central point distance Threshold value T₁Point put into a Canopy, from the point list of center delete with central point distance be less than or equal to threshold value T₂Point；

Step S13：Whether inspection center point list is empty set, if empty set, then operation terminates；If not empty set, then weigh Multiple step S12.

3. the detection method of a kind of approximately duplicate record based on cluster according to claim 1, which is characterized in that described Step S2 is specifically included：

Step S23：Compare to obtain two closest classes, if distance is less than or equal to threshold value k, two classes are merged, are gone to Step S24；Otherwise step S25 is gone to；

Step S24：The distance of new class and current class is calculated with maximum distance method, if the number of class is equal to 1, goes to step S25； Otherwise step S23 is returned to；

Step S25：Output record number is greater than or equal to 2 cluster.

4. the detection method of a kind of approximately duplicate record based on cluster according to claim 3, which is characterized in that described Threshold value k is determined by Canopy number in the step S1.

5. a kind of detection device of the approximately duplicate record based on cluster, which is characterized in that it includes：