GB2378789A - Removal of duplicates from large data sets - Google Patents

Removal of duplicates from large data sets Download PDF

Info

Publication number
GB2378789A
GB2378789A GB0210164A GB0210164A GB2378789A GB 2378789 A GB2378789 A GB 2378789A GB 0210164 A GB0210164 A GB 0210164A GB 0210164 A GB0210164 A GB 0210164A GB 2378789 A GB2378789 A GB 2378789A
Authority
GB
United Kingdom
Prior art keywords
subsets
data set
processors
records
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0210164A
Other versions
GB0210164D0 (en
Inventor
Sara Bearder
Jens Rasch
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DATACTICS Ltd
Original Assignee
DATACTICS Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DATACTICS Ltd filed Critical DATACTICS Ltd
Publication of GB0210164D0 publication Critical patent/GB0210164D0/en
Publication of GB2378789A publication Critical patent/GB2378789A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Data sets S and T are compared to remove records which are, within predefined limits, duplicate records. The data set S is divided into ž subsets and the data set T into n subsets, at least one of ž and n being greater than unity. The subsets of the first data set S are then compared with the subsets of the second data set T by use of parallel processors such that at least some of the comparisons are carried out simultaneously.

Description

<Desc/Clms Page number 1>
"Removal of Duplicates from Large Data Sets" This invention relates to locating duplicate records in large data sets.
As used herein,"data set"means a number of records each of which has a number of defined fields, and "duplicate"means a record which is, within predefined limits, the same as another record.
Background to the invention In recent years there has been an explosion in the size and availability of databases. There is also an increasing desire to squeeze as much information as possible from data sets, particularly for client profiling, marketing and business analysis. There is no great technical difficulty in patching together data from different data sets. However, the problem that then arises is duplication. Due to human nature, when people fill in forms they can allow quite a large breadth of variation from entry
<Desc/Clms Page number 2>
to entry, let alone problems caused by deliberate variation or address or name changes.
There are two important problems caused by such duplication. Firstly, multiple entries for the same individual may enable fraud. Secondly, it destroys the ability to keep good quality data and then marketing may result in duplicate letters being sent, which is detrimental to company image and leads to extra mailing costs.
It is known to remove duplicates from, or "deduplicate", a data set. It is relatively simple to establish a set of rules defining the degree of variation which will allow two records to be considered as distinct, or as duplicates. The difficulty is then that each record must be compared with every other record if a complete deduplication is to be achieved, and this is extremely costly in processor memory and time.
It has been proposed to effect a comparison which is more economical of processor time but which is not a complete comparison. For example, US 5,497, 486 describes a process in which records are first sorted or clustered in accordance with a selected key, and the sorted or clustered records are then scanned with a relatively small window. Such methods have the disadvantages that accuracy depends on the relevance and accuracy of the chosen sorting key, and that the comparison is not total.
<Desc/Clms Page number 3>
Summary of the invention The present invention provides a method of comparing two data sets, each of which contains a number of records, to detect duplicate records, the method comprising: dividing the first data set into u subsets; dividing the second data set into v subsets, at least one of and v being greater than 1; comparing each subset of the first set with each subset of the second set, at least some of said comparisons being carried out simultaneously in parallel.
Preferably, all of said comparisons are carried out simultaneously in parallel.
The subsets of the first data set preferably contain approximately equal numbers of records, and the subsets of the second data set preferably contain approximately equal numbers of records.
In a preferred form of the invention, said comparisons are carried out by parallel processors numbering nv processors, and each processor a
where osas-l obtains the subsets Sj. with i = a/v tj with j = a modv.
Said two data sets will typically consist of two different sets. However, the invention may also be
<Desc/Clms Page number 4>
applied to a single data set, by comparing it with itself.
From another aspect, the invention provides a data processing system for use in removing duplicate records from first and second data sets, each data set comprising a plurality of records, the system comprising: computing means operable to divide the first data set into subsets and to divide the second data set into v subsets, at least one of ju. and v being greater than 1 ; and a plurality of processors, each of which is operable to compare a subset of the first data set with a subset of the second data set to determine whether these are within a predefined definition of similarity; said processors operating in parallel whereby at least some of said comparisons are carried out simultaneously.
Preferably, said processors number Tc=) J. v, and each processor a where 0 sa sn -1 is operable to obtain and compare the subsets s with i = a/v tj with j = a modv.
Description of preferred examples Examples of the present invention will now be described, by way of example.
<Desc/Clms Page number 5>
In the following we shall deal with two data sets labelled S and T where the task is to find duplicates between S and T, that is records which appear in S and T simultaneously. Furthermore, they can appear more than once in S and in T, but must appear at least once in each data set to qualify as a duplicate.
In addition, the data sets S and T do not have to be distinct: they can be the same data set, in which case one would be looking for multiple records in the same data set. In the following we shall refer to two data sets S and T and imply that S and T may be identical or non-identical, except where otherwise stated.
We shall assume for the following analysis that the time taken to compare one record with another is the same for all records. This is a good approximation of real comparison algorithms.
In order to estimate the time required for the deduplication, consider the case of comparing the data set S consisting of m records with the data set T consisting of n records, as in Table 1.
<Desc/Clms Page number 6>
TABLE 1
S T Mr. G. Grant Mr. S. Stallone Mrs B. Bardot Mr. T. Cruise Mr H. Bogart Mrs N. Kidman m Mr B. Willis n n S : : : Mrs S. Weaver Mr. J. Wayne Mrs. J. Fonda ' In this case we have to compare all m records of S with all n records of T. Since we have assumed that the time for each comparison is the same, the time t required to do a complete comparison of all records is proportional to the product of the number of records, i. e. t oc or mn. This means on the one hand that the time required to check for a single new entry in a given database increases linearly with the size of that database. However, on the other hand, it means that the time for the deduplication of an existing data set (i. e. where S and T are the same) where every entry has to be compared with every other entry is proportional to n2 ; i. e. doubling the size of the database results in quadrupling of the time required. This is a limit that even a program with a very fast record comparison algorithm cannot overcome and will take prohibitively long for large data sets. In addition, large main computer memory (RAM) will be required to work on large data sets.
<Desc/Clms Page number 7>
The fact that the time for deduplication is of the order of n2 cannot be overcome on a single processor machine. In order to overcome this limitation and speed up the time required, the deduplication process needs to be parallelised. This means that for given data sets S and T the deduplication is executed on some number of processors simultaneously, leading to a dramatic reduction in the time required. A further advantage is that the data sets can be split up and spread over several computers so that all the available main memory of the computers involved can be harnessed.
General example of preferred algorithm There will now be described a novel algorithm for the deduplication of two data sets on multiple processor machines or computers.
In this section of the present description it is assumed that the processors are suitably interconnected such that it is possible for a given processor to send information of any type to any other participating processor. Nothing is assumed about the physical installation of these processors, They could, for example, be individual computers connected by electronic or fibre-optic link via intra-or internets, or they could be designed as highly integrated machine sets.
<Desc/Clms Page number 8>
Additionally, the following is assumed about the software layer which facilitates the communication and data exchange between the processors : (1) Each participating processor has a unique identifier.
(2) The software is capable of transferring the data sets or parts thereof to individual processors.
Typical communication software layers that include these conditions would be for example the Message Passing Interface (MPI).
Let us assume that there are n processors participating in the deduplication process and that n > l. In the present embodiment of the invention, the data set S is split into > 1 distinct subsets where the ith subset si (i = 0, 1,..., jn.-l) contains mi records ; see Table 2.
<Desc/Clms Page number 9>
TABLE 2
and where we have defined
Equally, the data set T is split up into v 1 distinct subsets tj where the jth subset (j = 0, 1,..., v-1) contains r ! j records. We require that
<Desc/Clms Page number 10>
/1-1 , =o i=O and that 1=0 En =n ) =0
since otherwise records would be omitted from the duplication check.
The most efficient way for the algorithm is to split the data set S in such a way that the subsets all contain the same or almost the same number of records. The same applies to the data set T.
Furthermore it is desirable to have the condition 7i=p. v. If we have TC > ) JL\' then not all processors will participate and computer resources are wasted. If on the other hand 7r < v then the algorithm described below would miss out on checking some subsets of the data sets for duplicates unless a complicated remapping takes place (which is possible but generally also leads to a waste of computer resources).
As stated above, it is assumed here that the n processors are labelled from 0 to n - 1. If this is not the case, then the computers can be relabelled accordingly, which can be done because the communication software provides unique identifiers.
We now require that the processor a where Osas-1 obtains the subsets
<Desc/Clms Page number 11>
Si with i = a/v (2) tj with j = amodv (3) In the calculation of i = a/vin equation (2) we retain only the integer value of the fraction. The following Table 3 illustrates the mapping of the processors.
Table 3
to t1 ... tj ... tv-1 so 01je-1 Si v v +1... 2v-1 S2 2v 2v +1... 3v -1 # # # # # Si iv iv+j iv +v -1 # # # # Sn-1 ( -1)v ( -1)v+1 ... ( -1)v+j ... v-1
Each processor will then check for duplicates between its respective subsets. It is clear from the way the data sets are split up and distributed across the processors that overall every record in S will be compared with every record in T just as it would be in the single processor case. The single processor case is included in this algorithm as a special case if = v = 1. However, we made the condition above that 7 =p. v > l.
<Desc/Clms Page number 12>
Furthermore, the foregoing does not depend on S and T being distinct. If S = T, the algorithm can equally be applied, although the comparison algorithm could usefully be modified such that each record would not be compared with itself, to avoid trivial duplicates.
The algorithm used to compare records is not described herein. Such algorithms are well known per se, and a comparison algorithm suitable for the nature of the data and the application for which the data is required can readily be selected.
The general example as thus far described has a number of advantages.
Since each processor receives only two subsets of the data sets, the storage requirement is greatly reduced. No processor has to hold the entirety of the two data sets, which could be beyond its main memory capacity.
The second advantage is speed. Rather than the overall time t for the deduplication process being proportional to p. v, it can now theoretically be reduced to a time proportional to jv/n, assuming the data sets are split into equal subsets and neglecting set-up times, for example communicating data sets, communicating data sets, input and output, etc.
<Desc/Clms Page number 13>
Another advantage is that in principle there is no requirement for communication between processors if the system is set up such that the processors only read in local copies of their respective subsets.
Any duplicates that are found on a given processor can then also be outputted locally. This adds to the speed advantage, since no processor time is wasted in waiting for data communications to be completed.
Worked examples Consider the following two data sets S and T where the first column is the record number and is not
considered part of the data set.
S T 1 Mr. G. Grant 1 Mr. S. Stallone 2 Mrs. B. Bardot 2 Mr. T. Cruise 3 Mr. H. Bogart 3 Mrs. N. Kidman 4 Mr. T. Cruise 4 Mr B. Willis 5 Mrs S. Weaver 6 Mr. T. Cruise 7 Mrs. J. Fonda 8 Mrs. B. Bardot 9 Mrs. J. Fonda 10 Mr. J. Wayne A deduplication on a single processor would have to compare 4 records in S against 10 records in T, amounting to 40 comparisons in total. It would thereby find that record number 2 in S is identical to record number 8 in T, and record number 4 in S is
<Desc/Clms Page number 14>
identical to records 2 and 6 in T. Note that we only identify records that occur simultaneously in S and T as duplicates, i. e. records 7 and 9 in T are not considered duplicates; they would of course be considered duplicates if T were deduplicated against itself.
Worked example 1-two processors According to the algorithm above we have the freedom to either split the dataset S or T into two parts.
Either case would lead to the same speed of the algorithm so that a possible criterion would be the size of the datasets for partitioning as mentioned above. We have therefore split the dataset T into two parts to and t1
1 Mr. S. Stallone 2 Mr. T Cruise to 3 Mrs. N. Kidman r T Mr. G. Grant 4 Mr. B. Willis 2 Mrs. B. Bardot 5 Mrs. S. Weaver 3 mur. H. Bogart 1 6 Mr. T Cruise 4 Mr. T Cruise 7 Mrs. J. Fonda t1 = 8 Mrs. B. Bardot 9 Mrs. J. Fonda 10 Mr. J. Wayne According to the prescription of the algorithm above we have that mo = 4, no = nl = 5 and for the processor labels we have JL=l, v =2 giving 8=ii =2.
The datasets would then be distributed to the two processors such that processor 0 would get the datasets so and to and processor 1 would get the datasets So and t : j, as shown in the following table
<Desc/Clms Page number 15>
(U E-H > 0 m 00 Q) zu - .
6 f-D I-D I-D U) co 0 r-tfnfLT) C) [ > ooC ;- < UJ 111 t- ;'CQ.
1 Mr. G. Grant 2 Mrs. B. Bardot 3 Mr. H. Bogart 0 1 4 Mr. T Cruise Processors 0 and 1 now have to perform 20 comparisons each (4 records in S against 5 records in T) amounting to 40 comparisons in total, however this is now done in half the time. Furthermore processor 0 will find the duplicate record 4 in S and 2 in T whereas processor 1 will find the duplicate record 4 in S and 6 in T as well as record 2 in S and 8 in T. although each processor only finds a subset of the duplicates overall all duplicates are being found.
Worked example 2-six processors In the case of six processors we can split the processors into a grid of 2x3, i. e. we divide the dataset S into 2 and T into 3 subsets as follows:
<Desc/Clms Page number 16>
1 Mr. S. Stallone to= 2 Mr. T. Cruise 3 Mrs. N. Kidman 1 mur. G. Grant {4 Mr. B. Willis "L [rs. B. Bardot t1 = 5 Mrs. S. Weaver Mr. H. Bogart 6 Mr. T Cruise 4 Mr. T Cruise 7 Mrs. J. Fonda 8 Mrs. B. Bardot t2 9 Mrs. J. Fonda 10 Mr. J. Wayne
Again according to the prescription of the algorithm above we have that mo = ml = 2, no = nl = 3, n2 = 4, and for the processor labels we have g = 2, v = 3, giving 7t = tv = 6 We now get the following distribution of datasets across the processors :
4 Q) tQ (U < D m - H > CO > i ., g - C fC g (U3 0 i- ! . . " . M.
.. M. m'M m' !-' S SS ! SSS o '-r) f' ; fLn tocot-t 1 Mr. G. Grant 0 1 2 2 Mrs. B. Bardot 3 Mr. H. Bogart 3 4 5 4 Mr. T. Cruise
Processors 0, 1, 3, 4 will each perform 2x3 = 6 comparisons whereas processors 2 and 5 will each perform 2x4 = 8 comparisons, i. e. they have to do slightly more work. That means the processors 0, 1,
<Desc/Clms Page number 17>
3, 4 will do the work in less than 1/6 and processor 2 and 5 in about 1/5 of the single processor time.
Duplicates will now be found by processor 2 (finding record 2 in S and 8 in T, respectively) as well as processors 3 and 4 (finding 4 in S and 2 and 6 in T, respectively). Again we see that all duplicates will be found.
Other algorithms There are of course other ways of parallelising the deduplication. The first thing that would come to mind would be to divide S and T into re subsets given that there are n processors. The problem that then arises is a large overhead of additional bookkeeping and inter-process communications. The following table using the previous example illustrates this if tu = 4.
(U H Q) 43 tu-ho (d 0 > 4 r-)-H'T ! fC' ! hCC [C3-Hr- < (UO < 00 > l 4-) - [ mhfc M-C U .. fi CD !-). ! h 0 o r-) C\ ff LD) [ > OOOT-) z h z A A h z k A z z t S S z t z x S z H A rn q ul % > > 1 Mr. G. Grant 0 2 Mrs. B. Bardot 1 3 Mr. H. Bogart 2 4 Mr. T. Cruise 3
<Desc/Clms Page number 18>
since processor a only holds the subsets sa and ta, only the"diagonal"can be processed. After this has been done processor a would need to perform a communication with another processor to get hold of
another subset tj where ja to continue the deduplication process. At least n -1 such communications for all processors are necessary so that overall all records in S will have been compared to all records in T. Additional information needs to be stored and possibly communicated so that each processor knows which processor holds which subset of S and T and furthermore which subsets have already been compared and which are next to be communicated. In addition to the time that a processor wastes by waiting for a data communication to finish, there is an additional waste of time. As can be seen in the example above the processors 2 and 3 actually have more work to do than processors 0 and 1. This means that the latter will have finished their comparisons before the former and would have to wait if they needed to communicate with processors 2 and 3 in order to get additional data subsets. Such situations are very likely to occur in real situations.
In contrast to all this the preferred algorithm does theoretically not need any communication at all thereby saving computer resources by closely achieving the theoretically possible speed up in time of v/7i.
<Desc/Clms Page number 19>
Implementation The foregoing examples are not dependent on any particular hardware implementation.
It is not specified above how the processors get the subsets. This depends on the hardware and communications software used. One possible arrangement is for all processors to have access to the same data sets provided on global or local storage such that each processor can individually read in its respective subsets directly.
Alternatively, on some hardware it might only be possible, or it might be faster, that one processor reads in the subsets and then communicates them to the individual processors as specified above.
The same applies to the results of the deduplication process. Duplicate records could either be outputted individually by each processor, or they could be communicated to one processor which is responsible for the data input and output.
Figs. 1 and 2 illustrate two such possible implementations.
In Fig. l, a central processor 10 receives the data sets S and T and divides these into the subsets SO-S,,-, and to-t,-,. The subsets are passed to a global memory 12, in which each subset is stored in a location corresponding to its identity. The
<Desc/Clms Page number 20>
comparison is carried out by a series of processors 14 equal in number to lav, therefore using the preferred algorithm as described above. Each of the processors is programmed to fetch one subset s and one subset t from a pair of predefined memory locations, effect the comparison, and output any duplicates at 16. The communication is effected via an intranet system 18.
The system illustrated in Fig. 2 operates on an alternative algorithm as discussed above. A control processor 20 receives the data sets S and T and divides these into the subsets su - spi and to-tv-I. There is a lesser number of processors in this example, equal in number to v. The control processor 20 transmits to each processor a respective pair of subsets ; this could comprise for example the same subset s to all processors and a different subset tj to each processor. The processors effect the comparison and transmit any duplicates to the control processor 20. This process must then be repeated times for each of the subsets S2, S3 etc.
Modifications may be made to the foregoing within the scope of the present invention.

Claims (7)

  1. CLAIMS 1. A method of comparing two data sets, each of which contains a number of records, to detect duplicate records, the method comprising : dividing the first data set into p. subsets ; dividing the second data set into v subsets, at least one ouf pans v being greater than 1 ; and comparing each subset of the first set with each subset of the second set, at least some of said comparisons being carried out simultaneously in parallel.
  2. 2. A method according to claim 1, in which all of said comparisons are carried out simultaneously in parallel.
  3. 3. A method according to claim 1 or claim 2, in which the subsets of the first data set contain approximately equal numbers of records, and the subsets of the second data set contain approximately equal numbers of records.
  4. 4. A method according to claim 2, in which said comparisons are carried out by parallel processors numbering 7t=v processors, and each processor a where 0 < a < -l obtains the subsets si with i = a/v t] with j = a modv.
    <Desc/Clms Page number 22>
  5. 5. A method according to any preceding claim, in which the first and second data sets are identical, whereby duplicates are removed from a single data set.
  6. 6. A data processing system for use in removing duplicate records from first and second data sets, each data set comprising a plurality of records, the system comprising : computing means operable to divide the first data set into J. subsets and to divide the second data set into v subsets, at least one of jn. and v being greater than 1 ; and a plurality of processors, each of which is operable to compare a subset of the first data set with a subset of the second data set to determine whether these are within a predefined definition of similarity ; said processors operating in parallel whereby at least some of said comparisons are carried out simultaneously.
  7. 7. A data processing system according to claim 6, in which said processors number n=gv, and each processor (x where 0 S ; a sn -1 is operable to obtain and compare the subsets Si with i = a/v tj with j = (x modv.
GB0210164A 2001-05-12 2002-05-03 Removal of duplicates from large data sets Withdrawn GB2378789A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GBGB0111648.2A GB0111648D0 (en) 2001-05-12 2001-05-12 Removal of duplicates from large data sets

Publications (2)

Publication Number Publication Date
GB0210164D0 GB0210164D0 (en) 2002-06-12
GB2378789A true GB2378789A (en) 2003-02-19

Family

ID=9914527

Family Applications (2)

Application Number Title Priority Date Filing Date
GBGB0111648.2A Ceased GB0111648D0 (en) 2001-05-12 2001-05-12 Removal of duplicates from large data sets
GB0210164A Withdrawn GB2378789A (en) 2001-05-12 2002-05-03 Removal of duplicates from large data sets

Family Applications Before (1)

Application Number Title Priority Date Filing Date
GBGB0111648.2A Ceased GB0111648D0 (en) 2001-05-12 2001-05-12 Removal of duplicates from large data sets

Country Status (1)

Country Link
GB (2) GB0111648D0 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809721B2 (en) 2007-11-16 2010-10-05 Iac Search & Media, Inc. Ranking of objects using semantic and nonsemantic features in a system and method for conducting a search
US7921108B2 (en) 2007-11-16 2011-04-05 Iac Search & Media, Inc. User interface and method in a local search system with automatic expansion
US8090714B2 (en) 2007-11-16 2012-01-03 Iac Search & Media, Inc. User interface and method in a local search system with location identification in a request
US8145703B2 (en) 2007-11-16 2012-03-27 Iac Search & Media, Inc. User interface and method in a local search system with related search results
US8180771B2 (en) 2008-07-18 2012-05-15 Iac Search & Media, Inc. Search activity eraser
US8352540B2 (en) 2008-03-06 2013-01-08 International Business Machines Corporation Distinguishing data streams to enhance data storage efficiency
US8560506B2 (en) 2011-12-07 2013-10-15 International Business Machines Corporation Automatic selection of blocking column for de-duplication
US8732155B2 (en) 2007-11-16 2014-05-20 Iac Search & Media, Inc. Categorization in a system and method for conducting a search

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5497486A (en) * 1994-03-15 1996-03-05 Salvatore J. Stolfo Method of merging large databases in parallel

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5497486A (en) * 1994-03-15 1996-03-05 Salvatore J. Stolfo Method of merging large databases in parallel

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809721B2 (en) 2007-11-16 2010-10-05 Iac Search & Media, Inc. Ranking of objects using semantic and nonsemantic features in a system and method for conducting a search
US7921108B2 (en) 2007-11-16 2011-04-05 Iac Search & Media, Inc. User interface and method in a local search system with automatic expansion
US8090714B2 (en) 2007-11-16 2012-01-03 Iac Search & Media, Inc. User interface and method in a local search system with location identification in a request
US8145703B2 (en) 2007-11-16 2012-03-27 Iac Search & Media, Inc. User interface and method in a local search system with related search results
US8732155B2 (en) 2007-11-16 2014-05-20 Iac Search & Media, Inc. Categorization in a system and method for conducting a search
US8352540B2 (en) 2008-03-06 2013-01-08 International Business Machines Corporation Distinguishing data streams to enhance data storage efficiency
US8180771B2 (en) 2008-07-18 2012-05-15 Iac Search & Media, Inc. Search activity eraser
US8560506B2 (en) 2011-12-07 2013-10-15 International Business Machines Corporation Automatic selection of blocking column for de-duplication
US8560505B2 (en) 2011-12-07 2013-10-15 International Business Machines Corporation Automatic selection of blocking column for de-duplication

Also Published As

Publication number Publication date
GB0111648D0 (en) 2001-07-04
GB0210164D0 (en) 2002-06-12

Similar Documents

Publication Publication Date Title
US10885012B2 (en) System and method for large-scale data processing using an application-independent framework
US5307485A (en) Method and apparatus for merging sorted lists in a multiprocessor shared memory system
US6505187B1 (en) Computing multiple order-based functions in a parallel processing database system
Shapiro Join processing in database systems with large main memories
US7389310B1 (en) Supercomputing environment for duplicate detection on web-scale data
US5497486A (en) Method of merging large databases in parallel
US6339777B1 (en) Method and system for handling foreign key update in an object-oriented database environment
Vats et al. Performance evaluation of K-means clustering on Hadoop infrastructure
US20090077078A1 (en) Methods and systems for merging data sets
Kolb et al. Don't match twice: redundancy-free similarity computation with MapReduce
CA2357937A1 (en) Database diagnostic system and method
Kolb et al. Learning-based entity resolution with MapReduce
Dehne et al. Efficient external memory algorithms by simulating coarse-grained parallel algorithms
US6295539B1 (en) Dynamic determination of optimal process for enforcing constraints
Lorie et al. A low communication sort algorithm for a parallel database machine
US20070112865A1 (en) Enforcing constraints from a parent table to a child table
GB2378789A (en) Removal of duplicates from large data sets
Lakshmi et al. Limiting factors of join performance on parallel processors
US20030195869A1 (en) Method and system for query processing by combining indexes of multilevel granularity or composition
Shaw A relational database machine architecture
CN114860722A (en) Data fragmentation method, device, equipment and medium based on artificial intelligence
Marinov A bloom filter application for processing big datasets through MapReduce framework
Sehili et al. Multi-party privacy preserving record linkage in dynamic metric space
CN114945902A (en) Shuffle reduction task with reduced I/O overhead
Keller et al. The one-to-one match operator of the Volcano query processing system

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)