GB2378789A

GB2378789A - Removal of duplicates from large data sets

Info

Publication number: GB2378789A
Application number: GB0210164A
Authority: GB
Inventors: Sara Bearder; Jens Rasch
Original assignee: DATACTICS Ltd
Current assignee: DATACTICS Ltd
Priority date: 2001-05-12
Filing date: 2002-05-03
Publication date: 2003-02-19
Also published as: GB0111648D0; GB0210164D0

Abstract

Data sets S and T are compared to remove records which are, within predefined limits, duplicate records. The data set S is divided into ž subsets and the data set T into n subsets, at least one of ž and n being greater than unity. The subsets of the first data set S are then compared with the subsets of the second data set T by use of parallel processors such that at least some of the comparisons are carried out simultaneously.

Description

"Removal of Duplicates from Large Data Sets" This invention relates to locating duplicate records in large data sets.

As used herein,"data set"means a number of records each of which has a number of defined fields, and "duplicate"means a record which is, within predefined limits, the same as another record.

Background to the invention In recent years there has been an explosion in the size and availability of databases. There is also an increasing desire to squeeze as much information as possible from data sets, particularly for client profiling, marketing and business analysis. There is no great technical difficulty in patching together data from different data sets. However, the problem that then arises is duplication. Due to human nature, when people fill in forms they can allow quite a large breadth of variation from entry

to entry, let alone problems caused by deliberate variation or address or name changes.

There are two important problems caused by such duplication. Firstly, multiple entries for the same individual may enable fraud. Secondly, it destroys the ability to keep good quality data and then marketing may result in duplicate letters being sent, which is detrimental to company image and leads to extra mailing costs.

It is known to remove duplicates from, or "deduplicate", a data set. It is relatively simple to establish a set of rules defining the degree of variation which will allow two records to be considered as distinct, or as duplicates. The difficulty is then that each record must be compared with every other record if a complete deduplication is to be achieved, and this is extremely costly in processor memory and time.

It has been proposed to effect a comparison which is more economical of processor time but which is not a complete comparison. For example, US 5,497, 486 describes a process in which records are first sorted or clustered in accordance with a selected key, and the sorted or clustered records are then scanned with a relatively small window. Such methods have the disadvantages that accuracy depends on the relevance and accuracy of the chosen sorting key, and that the comparison is not total.

Summary of the invention The present invention provides a method of comparing two data sets, each of which contains a number of records, to detect duplicate records, the method comprising: dividing the first data set into u subsets; dividing the second data set into v subsets, at least one of and v being greater than 1; comparing each subset of the first set with each subset of the second set, at least some of said comparisons being carried out simultaneously in parallel.

Preferably, all of said comparisons are carried out simultaneously in parallel.

The subsets of the first data set preferably contain approximately equal numbers of records, and the subsets of the second data set preferably contain approximately equal numbers of records.

In a preferred form of the invention, said comparisons are carried out by parallel processors numbering nv processors, and each processor a

where osas-l obtains the subsets Sj. with i = a/v tj with j = a modv.

Said two data sets will typically consist of two different sets. However, the invention may also be

applied to a single data set, by comparing it with itself.

From another aspect, the invention provides a data processing system for use in removing duplicate records from first and second data sets, each data set comprising a plurality of records, the system comprising: computing means operable to divide the first data set into subsets and to divide the second data set into v subsets, at least one of ju. and v being greater than 1 ; and a plurality of processors, each of which is operable to compare a subset of the first data set with a subset of the second data set to determine whether these are within a predefined definition of similarity; said processors operating in parallel whereby at least some of said comparisons are carried out simultaneously.

Preferably, said processors number Tc=) J. v, and each processor a where 0 sa sn -1 is operable to obtain and compare the subsets s with i = a/v tj with j = a modv.

Description of preferred examples Examples of the present invention will now be described, by way of example.

In the following we shall deal with two data sets labelled S and T where the task is to find duplicates between S and T, that is records which appear in S and T simultaneously. Furthermore, they can appear more than once in S and in T, but must appear at least once in each data set to qualify as a duplicate.

In addition, the data sets S and T do not have to be distinct: they can be the same data set, in which case one would be looking for multiple records in the same data set. In the following we shall refer to two data sets S and T and imply that S and T may be identical or non-identical, except where otherwise stated.

We shall assume for the following analysis that the time taken to compare one record with another is the same for all records. This is a good approximation of real comparison algorithms.

In order to estimate the time required for the deduplication, consider the case of comparing the data set S consisting of m records with the data set T consisting of n records, as in Table 1.

TABLE 1

S T Mr. G. Grant Mr. S. Stallone Mrs B. Bardot Mr. T. Cruise Mr H. Bogart Mrs N. Kidman m Mr B. Willis n n S : : : Mrs S. Weaver Mr. J. Wayne Mrs. J. Fonda ' In this case we have to compare all m records of S with all n records of T. Since we have assumed that the time for each comparison is the same, the time t required to do a complete comparison of all records is proportional to the product of the number of records, i. e. t oc or mn. This means on the one hand that the time required to check for a single new entry in a given database increases linearly with the size of that database. However, on the other hand, it means that the time for the deduplication of an existing data set (i. e. where S and T are the same) where every entry has to be compared with every other entry is proportional to n2 ; i. e. doubling the size of the database results in quadrupling of the time required. This is a limit that even a program with a very fast record comparison algorithm cannot overcome and will take prohibitively long for large data sets. In addition, large main computer memory (RAM) will be required to work on large data sets.

The fact that the time for deduplication is of the order of n2 cannot be overcome on a single processor machine. In order to overcome this limitation and speed up the time required, the deduplication process needs to be parallelised. This means that for given data sets S and T the deduplication is executed on some number of processors simultaneously, leading to a dramatic reduction in the time required. A further advantage is that the data sets can be split up and spread over several computers so that all the available main memory of the computers involved can be harnessed.

General example of preferred algorithm There will now be described a novel algorithm for the deduplication of two data sets on multiple processor machines or computers.

In this section of the present description it is assumed that the processors are suitably interconnected such that it is possible for a given processor to send information of any type to any other participating processor. Nothing is assumed about the physical installation of these processors, They could, for example, be individual computers connected by electronic or fibre-optic link via intra-or internets, or they could be designed as highly integrated machine sets.

Additionally, the following is assumed about the software layer which facilitates the communication and data exchange between the processors : (1) Each participating processor has a unique identifier.

(2) The software is capable of transferring the data sets or parts thereof to individual processors.

Typical communication software layers that include these conditions would be for example the Message Passing Interface (MPI).

Let us assume that there are n processors participating in the deduplication process and that n > l. In the present embodiment of the invention, the data set S is split into > 1 distinct subsets where the ith subset si (i = 0, 1,..., jn.-l) contains mi records ; see Table 2.

TABLE 2

and where we have defined

Equally, the data set T is split up into v 1 distinct subsets tj where the jth subset (j = 0, 1,..., v-1) contains r ! j records. We require that

/1-1 , =o i=O and that 1=0 En =n ) =0

since otherwise records would be omitted from the duplication check.

The most efficient way for the algorithm is to split the data set S in such a way that the subsets all contain the same or almost the same number of records. The same applies to the data set T.

Furthermore it is desirable to have the condition 7i=p. v. If we have TC > ) JL\' then not all processors will participate and computer resources are wasted. If on the other hand 7r < v then the algorithm described below would miss out on checking some subsets of the data sets for duplicates unless a complicated remapping takes place (which is possible but generally also leads to a waste of computer resources).

As stated above, it is assumed here that the n processors are labelled from 0 to n - 1. If this is not the case, then the computers can be relabelled accordingly, which can be done because the communication software provides unique identifiers.

We now require that the processor a where Osas-1 obtains the subsets

Si with i = a/v (2) tj with j = amodv (3) In the calculation of i = a/vin equation (2) we retain only the integer value of the fraction. The following Table 3 illustrates the mapping of the processors.

Table 3

to t1 ... tj ... tv-1 so 01je-1 Si v v +1... 2v-1 S2 2v 2v +1... 3v -1 # # # # # Si iv iv+j iv +v -1 # # # # Sn-1 ( -1)v ( -1)v+1 ... ( -1)v+j ... v-1

Each processor will then check for duplicates between its respective subsets. It is clear from the way the data sets are split up and distributed across the processors that overall every record in S will be compared with every record in T just as it would be in the single processor case. The single processor case is included in this algorithm as a special case if = v = 1. However, we made the condition above that 7 =p. v > l.

Furthermore, the foregoing does not depend on S and T being distinct. If S = T, the algorithm can equally be applied, although the comparison algorithm could usefully be modified such that each record would not be compared with itself, to avoid trivial duplicates.

The algorithm used to compare records is not described herein. Such algorithms are well known per se, and a comparison algorithm suitable for the nature of the data and the application for which the data is required can readily be selected.

The general example as thus far described has a number of advantages.

Since each processor receives only two subsets of the data sets, the storage requirement is greatly reduced. No processor has to hold the entirety of the two data sets, which could be beyond its main memory capacity.

The second advantage is speed. Rather than the overall time t for the deduplication process being proportional to p. v, it can now theoretically be reduced to a time proportional to jv/n, assuming the data sets are split into equal subsets and neglecting set-up times, for example communicating data sets, communicating data sets, input and output, etc.

Another advantage is that in principle there is no requirement for communication between processors if the system is set up such that the processors only read in local copies of their respective subsets.

Any duplicates that are found on a given processor can then also be outputted locally. This adds to the speed advantage, since no processor time is wasted in waiting for data communications to be completed.

Worked examples Consider the following two data sets S and T where the first column is the record number and is not

considered part of the data set.

S T 1 Mr. G. Grant 1 Mr. S. Stallone 2 Mrs. B. Bardot 2 Mr. T. Cruise 3 Mr. H. Bogart 3 Mrs. N. Kidman 4 Mr. T. Cruise 4 Mr B. Willis 5 Mrs S. Weaver 6 Mr. T. Cruise 7 Mrs. J. Fonda 8 Mrs. B. Bardot 9 Mrs. J. Fonda 10 Mr. J. Wayne A deduplication on a single processor would have to compare 4 records in S against 10 records in T, amounting to 40 comparisons in total. It would thereby find that record number 2 in S is identical to record number 8 in T, and record number 4 in S is

identical to records 2 and 6 in T. Note that we only identify records that occur simultaneously in S and T as duplicates, i. e. records 7 and 9 in T are not considered duplicates; they would of course be considered duplicates if T were deduplicated against itself.

Worked example 1-two processors According to the algorithm above we have the freedom to either split the dataset S or T into two parts.

Either case would lead to the same speed of the algorithm so that a possible criterion would be the size of the datasets for partitioning as mentioned above. We have therefore split the dataset T into two parts to and t1

1 Mr. S. Stallone 2 Mr. T Cruise to 3 Mrs. N. Kidman r T Mr. G. Grant 4 Mr. B. Willis 2 Mrs. B. Bardot 5 Mrs. S. Weaver 3 mur. H. Bogart 1 6 Mr. T Cruise 4 Mr. T Cruise 7 Mrs. J. Fonda t1 = 8 Mrs. B. Bardot 9 Mrs. J. Fonda 10 Mr. J. Wayne According to the prescription of the algorithm above we have that mo = 4, no = nl = 5 and for the processor labels we have JL=l, v =2 giving 8=ii =2.

The datasets would then be distributed to the two processors such that processor 0 would get the datasets so and to and processor 1 would get the datasets So and t : j, as shown in the following table

(U E-H > 0 m 00 Q) zu - .

6 f-D I-D I-D U) co 0 r-tfnfLT) C) [ > ooC ;- < UJ 111 t- ;'CQ.

1 Mr. G. Grant 2 Mrs. B. Bardot 3 Mr. H. Bogart 0 1 4 Mr. T Cruise Processors 0 and 1 now have to perform 20 comparisons each (4 records in S against 5 records in T) amounting to 40 comparisons in total, however this is now done in half the time. Furthermore processor 0 will find the duplicate record 4 in S and 2 in T whereas processor 1 will find the duplicate record 4 in S and 6 in T as well as record 2 in S and 8 in T. although each processor only finds a subset of the duplicates overall all duplicates are being found.

Worked example 2-six processors In the case of six processors we can split the processors into a grid of 2x3, i. e. we divide the dataset S into 2 and T into 3 subsets as follows:

1 Mr. S. Stallone to= 2 Mr. T. Cruise 3 Mrs. N. Kidman 1 mur. G. Grant {4 Mr. B. Willis "L [rs. B. Bardot t1 = 5 Mrs. S. Weaver Mr. H. Bogart 6 Mr. T Cruise 4 Mr. T Cruise 7 Mrs. J. Fonda 8 Mrs. B. Bardot t2 9 Mrs. J. Fonda 10 Mr. J. Wayne

Again according to the prescription of the algorithm above we have that mo = ml = 2, no = nl = 3, n2 = 4, and for the processor labels we have g = 2, v = 3, giving 7t = tv = 6 We now get the following distribution of datasets across the processors :

4 Q) tQ (U < D m - H > CO > i ., g - C fC g (U3 0 i- ! . . " . M.

.. M. m'M m' !-' S SS ! SSS o '-r) f' ; fLn tocot-t 1 Mr. G. Grant 0 1 2 2 Mrs. B. Bardot 3 Mr. H. Bogart 3 4 5 4 Mr. T. Cruise

Processors 0, 1, 3, 4 will each perform 2x3 = 6 comparisons whereas processors 2 and 5 will each perform 2x4 = 8 comparisons, i. e. they have to do slightly more work. That means the processors 0, 1,

3, 4 will do the work in less than 1/6 and processor 2 and 5 in about 1/5 of the single processor time.

Duplicates will now be found by processor 2 (finding record 2 in S and 8 in T, respectively) as well as processors 3 and 4 (finding 4 in S and 2 and 6 in T, respectively). Again we see that all duplicates will be found.

Other algorithms There are of course other ways of parallelising the deduplication. The first thing that would come to mind would be to divide S and T into re subsets given that there are n processors. The problem that then arises is a large overhead of additional bookkeeping and inter-process communications. The following table using the previous example illustrates this if tu = 4.

(U H Q) 43 tu-ho (d 0 > 4 r-)-H'T ! fC' ! hCC [C3-Hr- < (UO < 00 > l 4-) - [ mhfc M-C U .. fi CD !-). ! h 0 o r-) C\ ff LD) [ > OOOT-) z h z A A h z k A z z t S S z t z x S z H A rn q ul % > > 1 Mr. G. Grant 0 2 Mrs. B. Bardot 1 3 Mr. H. Bogart 2 4 Mr. T. Cruise 3

since processor a only holds the subsets sa and ta, only the"diagonal"can be processed. After this has been done processor a would need to perform a communication with another processor to get hold of

another subset tj where ja to continue the deduplication process. At least n -1 such communications for all processors are necessary so that overall all records in S will have been compared to all records in T. Additional information needs to be stored and possibly communicated so that each processor knows which processor holds which subset of S and T and furthermore which subsets have already been compared and which are next to be communicated. In addition to the time that a processor wastes by waiting for a data communication to finish, there is an additional waste of time. As can be seen in the example above the processors 2 and 3 actually have more work to do than processors 0 and 1. This means that the latter will have finished their comparisons before the former and would have to wait if they needed to communicate with processors 2 and 3 in order to get additional data subsets. Such situations are very likely to occur in real situations.

In contrast to all this the preferred algorithm does theoretically not need any communication at all thereby saving computer resources by closely achieving the theoretically possible speed up in time of v/7i.

Implementation The foregoing examples are not dependent on any particular hardware implementation.

It is not specified above how the processors get the subsets. This depends on the hardware and communications software used. One possible arrangement is for all processors to have access to the same data sets provided on global or local storage such that each processor can individually read in its respective subsets directly.

Alternatively, on some hardware it might only be possible, or it might be faster, that one processor reads in the subsets and then communicates them to the individual processors as specified above.

The same applies to the results of the deduplication process. Duplicate records could either be outputted individually by each processor, or they could be communicated to one processor which is responsible for the data input and output.

Figs. 1 and 2 illustrate two such possible implementations.

In Fig. l, a central processor 10 receives the data sets S and T and divides these into the subsets SO-S,,-, and to-t,-,. The subsets are passed to a global memory 12, in which each subset is stored in a location corresponding to its identity. The

comparison is carried out by a series of processors 14 equal in number to lav, therefore using the preferred algorithm as described above. Each of the processors is programmed to fetch one subset s and one subset t from a pair of predefined memory locations, effect the comparison, and output any duplicates at 16. The communication is effected via an intranet system 18.

The system illustrated in Fig. 2 operates on an alternative algorithm as discussed above. A control processor 20 receives the data sets S and T and divides these into the subsets su - spi and to-tv-I. There is a lesser number of processors in this example, equal in number to v. The control processor 20 transmits to each processor a respective pair of subsets ; this could comprise for example the same subset s to all processors and a different subset tj to each processor. The processors effect the comparison and transmit any duplicates to the control processor 20. This process must then be repeated times for each of the subsets S2, S3 etc.

Modifications may be made to the foregoing within the scope of the present invention.

Claims

CLAIMS 1. A method of comparing two data sets, each of which contains a number of records, to detect duplicate records, the method comprising : dividing the first data set into p. subsets ; dividing the second data set into v subsets, at least one ouf pans v being greater than 1 ; and comparing each subset of the first set with each subset of the second set, at least some of said comparisons being carried out simultaneously in parallel.
2. A method according to claim 1, in which all of said comparisons are carried out simultaneously in parallel.
3. A method according to claim 1 or claim 2, in which the subsets of the first data set contain approximately equal numbers of records, and the subsets of the second data set contain approximately equal numbers of records.
4. A method according to claim 2, in which said comparisons are carried out by parallel processors numbering 7t=v processors, and each processor a where 0 < a < -l obtains the subsets si with i = a/v t] with j = a modv.

<Desc/Clms Page number 22>
5. A method according to any preceding claim, in which the first and second data sets are identical, whereby duplicates are removed from a single data set.
6. A data processing system for use in removing duplicate records from first and second data sets, each data set comprising a plurality of records, the system comprising : computing means operable to divide the first data set into J. subsets and to divide the second data set into v subsets, at least one of jn. and v being greater than 1 ; and a plurality of processors, each of which is operable to compare a subset of the first data set with a subset of the second data set to determine whether these are within a predefined definition of similarity ; said processors operating in parallel whereby at least some of said comparisons are carried out simultaneously.
7. A data processing system according to claim 6, in which said processors number n=gv, and each processor (x where 0 S ; a sn -1 is operable to obtain and compare the subsets Si with i = a/v tj with j = (x modv.