CN1973286A - Optimizing database access for record linkage by tiling the space of record pairs - Google Patents
Optimizing database access for record linkage by tiling the space of record pairs Download PDFInfo
- Publication number
- CN1973286A CN1973286A CN 200580006829 CN200580006829A CN1973286A CN 1973286 A CN1973286 A CN 1973286A CN 200580006829 CN200580006829 CN 200580006829 CN 200580006829 A CN200580006829 A CN 200580006829A CN 1973286 A CN1973286 A CN 1973286A
- Authority
- CN
- China
- Prior art keywords
- quadrant
- database
- section
- record
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
A system and method for optimizing database access for record linkage by tiling the space of record pairs are provided, the system including a processor, a segmentation and pairing unit in signal communication with the processor for segmenting database data into data segments and pairing the data segments into data quadrants, and a duplicate detection unit in signal communication with the processor for detecting duplicates for each quadrant; and the method including receiving database data, segmenting the database data into data segments, pairing the data segments into data quadrants, and detecting duplicates for each quadrant.
Description
The cross reference of related application
The application require in the name that on March 5th, 2004 submitted to be called " Optimizing DatabaseAccess for Record Linkage by Tiling the Space of Record Pairs ", sequence number is No.60/550, the rights and interests of the U.S. Provisional Application of 454 (acting on behalf of files No.2004P03682US) are introduced into as a reference with its integral body in this this application.
Background technology
The record linkage of database be the record that finds the identical entity of expression to or the problem of record set.For the large database that is not suitable for random access memory fully, all possible record is right relates to relatively repeatedly that database reads, so that the data recording that need be compared enters in the storer.This may be a kind of operation of time-consuming and poor efficiency.
In the former technology of considering, each database reads all and will be loaded into those records that will be compared in the storer, and those write down such as being those records with same block key value.There are some shortcomings in these methods.Shortcoming be this quantity big and therefore the number of times that reads of required database be big.Another shortcoming is that the piece size can change in wide region.For little piece, this method causes the waste of memory resource.For too big piece, it causes storer overflow error.
Therefore, wish to optimize the database access that is used for record linkage.
Summary of the invention
By being used for by optimizing these and other shortcoming and the defective that the canonical system of the database access that is used for record linkage and method solve prior art to writing down right space piecemeal.
Be used for by optimizing the canonical system of the database access that is used for record linkage and comprise processor, carry out being used for segmentation that database data is segmented into data segment and these data segments are paired into the data quadrant and pairing unit and carrying out the duplicate detection unit of being used for of signal communication of signal communication at each quadrant detection copy with processor with processor to writing down right space piecemeal.
Be used for by optimizing the typical method of the database access that is used for record linkage and comprise and obtain database data, database data is segmented into data segment, data segment is paired into the data quadrant and detects copy at each quadrant writing down right space piecemeal.
According to the following description to exemplary embodiments that should read in conjunction with the accompanying drawings, these of present disclosure and other aspects, feature and advantage will become apparent.
Description of drawings
According to following exemplary view, the present disclosure explanation is used for by optimizing the System and method for of the database access that is used for record linkage to writing down right space piecemeal, wherein:
Fig. 1 illustrate according to the illustrative embodiment of present disclosure, be used for by optimizing the synoptic diagram of the system of the database access that is used for record linkage writing down right space piecemeal; And
Fig. 2 illustrate according to the illustrative embodiment of present disclosure, be used for by optimizing the process flow diagram of the method for the database access that is used for record linkage writing down right space piecemeal.
Embodiment
Provide a kind of and be used for making when record linkage database to read minimized piecemeal technology, this piecemeal technology comprises by optimizing the database access that is used for record linkage to writing down right space piecemeal.The piecemeal technology is divided into a plurality of record linkage problems than small database with the record linkage of large database or duplicate detection problem, wherein can be loaded in the storer fully separately than small database.The number of times that this technology reads database minimizes, and the scope of piece size is dwindled, so that effective utilization of memory resource maximizes and avoid storer overflow error.
The example embodiment of present disclosure is guaranteeing that in a period of time any record minimizes the number of times that in will be in storer available database is read.In addition, these embodiment make the stable and maximization of the number that will read in the record in the storer at every turn reading.
As shown in fig. 1, according to the illustrative embodiment of present disclosure, be used for by optimizing the system of the database access that is used for record linkage and summarize with Reference numeral 100 and represent writing down right space piecemeal.System 100 comprises at least one processor or the CPU (central processing unit) (CPU) 102 of carrying out signal communication with system bus 104.ROM (read-only memory) (ROM) 106, random-access memory (ram) 108, display adapter 110, I/O adapter 112, user interface adapter 114 and communication adapter 128 also carry out signal communication with system bus 104.Display unit 116 carries out signal communication by display adapter 110 and system bus 104.Disk storage unit 118, for example disk or rom memory cell carry out signal communication by I/O adapter 112 and system bus 104.Mouse 120, keyboard 122 and eye movement tracking means 124 carry out signal communication by user interface adapter 114 and system bus 104.
Segmentation and pairing unit 170 and duplicate detection unit 180 also are included in the system 100, and carry out signal communication with CPU 102 and system bus 104.Though being shown as, segmentation and pairing unit 170 and duplicate detection unit 180 be coupled at least one processor or CPU 102, but preferably embody these assemblies with the computer program code that is stored at least one in storer 106,108 and 118, wherein this computer program code is carried out by CPU 102.
Turn to Fig. 2, according to the illustrative embodiment of present disclosure, be used for by optimizing the method for the database access that is used for record linkage and summarize with Reference numeral 200 and represent writing down right space piecemeal.Method 200 comprises beginning module 210, and this begins module and passes control to load module 212.Load module 212 receives database data and passes control to functional module 214.214 pairs of data database datas of functional module carry out segmentation and pass control to functional module 216.Functional module 216 is paired into section quadrant again and passes control to functional module 218.Functional module 218 detects copy and passes control at each quadrant and finishes module 220.
In operation, this technology allows the big task or the work of record linkage are divided into a plurality of less tasks or quadrant.Each quadrant is suitable for the RAM of processing unit fully.Therefore, can be on a CPU or on a plurality of independent CPUs, sequentially handle quadrant concurrently.
That large database is divided into is a plurality of, be s non-intersect and section that equate basically.Determine the number of the record in the section based on two parameters: the size of (1) memory span and (2) record; Make 2 sections to be loaded in the storer.Select the criteria for classifying, so that the section of reading in the storer is the most effective.For example, section is decided by the scope of Record ID.
From s section, it is right to form s (s-1)/2 section.Right by form each that be called quadrant in conjunction with segment number i and segment number j, i<j wherein.
At each quadrant, carry out duplicate detection.In brief, will be divided into s (s-1)/2 duplicate detection work to the duplicate detection work of database that N bar record is arranged to database that 2N/s bar record is arranged.In those work each all needs twice database to read.(s (s-1)/2+1) can arrange to handle the order of these work for the number of times that all (s (s-1)/2) databases of individual quadrant are read equals.An example of such order is as follows: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, s).This is to be used for guaranteeing the minimum number that will find any a pair of database in the N bar record to read simultaneously at storer.
The processing of quadrant is slightly different each other.For the quadrant (1,2) of section, consider that all are right.In each quadrant, in fact be not to consider to some extent to all comparing.To it must satisfy a certain condition before will being compared.That is to say to have only when recording identical piece key for two, two records are compared.Here, the piece key is the set of preassigned index, and the piece key value of record is the character string on those assigned addresses.For quadrant (1, i), if in (1) section of being recorded in 1 and among another section of being recorded in i or (2) two records all in section i, will consider a pair of two records so.(i, j) j>i>1 will be if among the section of being recorded in i and among another section of being recorded in j, will consider this two records so for quadrant.This flexible program is guaranteed and will be considered any a pair of in the N bar record in a unique quadrant.
Therefore, the number of times that database is read by (1) minimizes, and (2) maximally utilise retrievable memory span and (3) and guarantee that record not is to being compared the good performance that reaches this optimization technique for twice.
In the alternate embodiment of equipment 100, the some or all of of register storage computation machine program code on the processor chips 102 can be arranged in.In addition, can produce the various alternative configuration and the embodiment of other assembly of segmentation and pairing unit 170 and duplicate detection unit 180 and system 100.
The instruction that it should be understood that present disclosure can realize with various forms of hardware, software, firmware, application specific processor or their combination.Most preferably, the instruction of present disclosure is implemented as the combination of hardware and software.
In addition, software preferably is embodied as the application program that is comprised in really on the program storage unit (PSU).This application program can be uploaded on the machine that comprises any appropriate configuration and by this machine and carry out.Preferably, implement this machine on computer platform, this computer platform has the hardware of for example one or more CPU (central processing unit) (CPU), random-access memory (ram) and I/O (I/O) interface.
This computer platform also can comprise operating system and micro-instruction code.Various processing described herein and function can be can be by CPU a part or the part of application program or their combination that carry out, micro-instruction code.In addition, other various peripherals, for example additional data storage cell can be connected with this computer platform with print unit.
Should further be appreciated that since in the accompanying drawing assembly and the method for some construction systems of being described preferably realize with software, so the actual connection between system component or the processing capacity module can be according to the mode that present disclosure is programmed and difference.In this given instruction, those of ordinary skill in the related art can imagine these and similarly embodiment or configuration of present disclosure.
Though with reference to the accompanying drawings illustrative embodiment is described here, what it should be understood that is, present disclosure is not restricted to those clear and definite embodiment, and those of ordinary skill in the related art can do not depart from the scope of the present invention or the situation of spirit under realize variations and modifications.Variation and the modification of intention comprising all in the scope of the present disclosure of setting forth as appended claim.
Claims (20)
1, a kind of being used for by optimizing the method for the database access that is used for record linkage to writing down right space piecemeal, this method comprises:
Receive database data;
Database data is segmented into data segment;
Data segment is paired into the data quadrant; And
Detect copy at each quadrant.
2, the method for claim 1, wherein segmentation comprises large database is divided into a plurality of non-intersect and sections of equating basically.
3, the method for claim 1, wherein in response to the size of memory span and record, each section comprises many records, so that two sections are in the scope of memory span.
4, the method for claim 1, wherein a section right number that forms by s section be s (s-1)/2 section to or quadrant.
5, method as claimed in claim 4, wherein each to or quadrant by forming in conjunction with segment number i and segment number j, wherein i is less than j.
6, method as claimed in claim 4 wherein detects copy to the database that N bar record is arranged and is divided into s (s-1)/2 the duplicate detection task to database that 2N/s bar record is arranged, so that each carries out twice database and reads in those work.
7, method as claimed in claim 6, wherein arrange to handle the order of described work like this, so that the number of times that reads at the database of all (s (s-1)/2) individual quadrants is for (s (s-1)/2+1), this number of times are used for guaranteeing the minimum number that will find any a pair of database in the N bar record to read simultaneously at storer.
8, method as claimed in claim 7, the order of wherein handling quadrant work is: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, s).
9, the method for claim 1, wherein detect copy at each quadrant and comprise:
At the quadrant (1,2) of section, consider that all are right;
At quadrant (1, i), if in section of being recorded in 1 and among another section of being recorded in i or two records all in section i, consider a pair of two records so; And
At quadrant (i, j), j>i>1 wherein is if among the section of being recorded in i and among another section of being recorded in j, consider a pair of two records so;
Wherein will in a unique quadrant, consider any a pair of in the N bar record.
10, a kind of being used for by optimizing the system of the database access that is used for record linkage to writing down right space piecemeal, this system comprises:
At least one processor;
Carry out the segmentation and the pairing unit of signal communication with at least one processor, be used for database data is segmented into data segment and data segment is paired into the data quadrant; And
Carry out the duplicate detection unit of signal communication with at least one processor, be used for detecting copy at each quadrant.
11, system as claimed in claim 10 further comprises with processor carries out signal communication being used for receiving the input/output adapter of database data and at least one of communication adapter.
12, system as claimed in claim 10, wherein segmentation and pairing unit comprise the device that is used for large database is divided into a plurality of non-intersect and sections of equating basically.
13, system as claimed in claim 10, wherein segmentation and pairing unit comprise that the size that is used in response to memory span and record comprises many records so that the device in the scope of two sections in memory span at each section.
14, system as claimed in claim 10, wherein segmentation and pairing unit comprise be used for being defined as by section right number that s section forms s (s-1)/2 section to or the device of quadrant.
15, system as claimed in claim 10, wherein segmentation and pairing unit comprise be used for by combine segment number i and segment number j form each to or the device of quadrant, wherein i is less than j.
16, system as claimed in claim 10, wherein duplicate detection unit comprises the device that is used for coming in the following manner the database that N bar record is arranged is detected copy, be about to this task division and be s (s-1)/2 duplicate detection work, read so that each in those work is carried out twice database to database that 2N/s bar record is arranged.
17, system as claimed in claim 16, wherein duplicate detection unit comprises and is used for like this number of times that reads to the handling ordered of described work so that at the database of all (s (s-1)/2) individual quadrants for (device of s (s-1)/2+1), this number of times are to be used for guaranteeing the minimum number that finds any a pair of database in the N bar record to read at storer with simultaneously.
18, system as claimed in claim 17, wherein duplicate detection unit comprises that the handling ordered that is used for quadrant work is: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, device s).
19, system as claimed in claim 10, wherein duplicate detection unit comprises the device that is used for detecting at each quadrant copy, this device comprises:
The quadrant (1,2) that is used at section is considered the device that all are right;
Be used at quadrant (1, if i) in section of being recorded in 1 and among another section of being recorded in i or two records all in section i, consider the device of a pair of two records so; And
Be used for that (i is if consider the device of a pair of two records, wherein j>i>1 j) among the section of being recorded in i and among another section of being recorded in j, so at quadrant;
Wherein will in a unique quadrant, consider any a pair of in the N bar record.
20, a kind of machine-readable program storage device, this program storage device comprises really can be by the instruction repertorie that is used for program execution process of this machine execution, this program step is used for by optimizing the database access that is used for record linkage to writing down right space piecemeal, and this program step comprises:
Receive database data;
Database data is segmented into data segment;
Data segment is paired into the data quadrant; And
Detect copy at each quadrant.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US55045404P | 2004-03-05 | 2004-03-05 | |
US60/550,454 | 2004-03-05 | ||
US11/067,992 | 2005-02-28 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1973286A true CN1973286A (en) | 2007-05-30 |
CN100543738C CN100543738C (en) | 2009-09-23 |
Family
ID=38113177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2005800068291A Expired - Fee Related CN100543738C (en) | 2004-03-05 | 2005-03-02 | By optimizing the database access that is used for record linkage to writing down right space piecemeal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100543738C (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112817767A (en) * | 2021-02-24 | 2021-05-18 | 上海交通大学 | Method and system for realizing optimization of graph computation working set under separated combined architecture |
CN113748388A (en) * | 2019-03-01 | 2021-12-03 | 西门子股份公司 | Method and apparatus for computer-aided optimization of tool occupancy of library space |
-
2005
- 2005-03-02 CN CNB2005800068291A patent/CN100543738C/en not_active Expired - Fee Related
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113748388A (en) * | 2019-03-01 | 2021-12-03 | 西门子股份公司 | Method and apparatus for computer-aided optimization of tool occupancy of library space |
CN112817767A (en) * | 2021-02-24 | 2021-05-18 | 上海交通大学 | Method and system for realizing optimization of graph computation working set under separated combined architecture |
Also Published As
Publication number | Publication date |
---|---|
CN100543738C (en) | 2009-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6952794B2 (en) | Method, system and apparatus for scanning newly added disk drives and automatically updating RAID configuration and rebuilding RAID data | |
CN1760875B (en) | Transparent migration of files among various types of storage volumes based on file access properties | |
CN100409240C (en) | System and method for efficient file content searching within a file system | |
CN1113291C (en) | Automatic configuration generation | |
US8271991B2 (en) | Method of analyzing performance in a storage system | |
US7353496B2 (en) | Storage controller software development support system and software development support method | |
CN102236699B (en) | For quick superscale process is to normalized | |
US20140325148A1 (en) | Data storage devices which supply host with data processing latency information, and related data processing methods | |
US20070005556A1 (en) | Probabilistic techniques for detecting duplicate tuples | |
CN100590596C (en) | Multi-node computer system and method for monitoring capability | |
US11599463B2 (en) | Servicing queries during data ingress | |
US7403936B2 (en) | Optimizing database access for record linkage by tiling the space of record pairs | |
US20080168226A1 (en) | Correction method for reading data of disk array system | |
CN101196889A (en) | Document placing method and device of optimizing memory system | |
US20240118939A1 (en) | Utilizing key value-based record distribution data to perform parallelized segment generation in a database system | |
CN100543738C (en) | By optimizing the database access that is used for record linkage to writing down right space piecemeal | |
CN1679009B (en) | Method and apparatus to permit external access to internal configuration register | |
Kotz | Expanding the potential for disk-directed I/O | |
US7162665B2 (en) | Information processing system, method for outputting log data, and computer-readable medium storing a computer software program for the same | |
CN107632779A (en) | Data processing method and device, server | |
JPH08129461A (en) | Auxiliary storage device | |
US11983424B2 (en) | Read disturb information isolation system | |
US11763898B2 (en) | Value-voltage-distirubution-intersection-based read disturb information determination system | |
Patki et al. | Computer Hardware Devices in Efficient E-Servicing: Case Study of Disk Scheduling by Soft Computing | |
Agrawal et al. | Comparative Analysis of RocksDB, LMDB, and MongoDB: A Performance Evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090923 Termination date: 20120302 |