CN1973286A - Optimizing database access for record linkage by tiling the space of record pairs - Google Patents

Optimizing database access for record linkage by tiling the space of record pairs Download PDF

Info

Publication number
CN1973286A
CN1973286A CN 200580006829 CN200580006829A CN1973286A CN 1973286 A CN1973286 A CN 1973286A CN 200580006829 CN200580006829 CN 200580006829 CN 200580006829 A CN200580006829 A CN 200580006829A CN 1973286 A CN1973286 A CN 1973286A
Authority
CN
China
Prior art keywords
quadrant
database
section
record
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200580006829
Other languages
Chinese (zh)
Other versions
CN100543738C (en
Inventor
P·H·蒋
S·桑迪尔亚
W·A·兰迪
R·B·劳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens Medical Solutions USA Inc
Original Assignee
Siemens Medical Solutions USA Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Medical Solutions USA Inc filed Critical Siemens Medical Solutions USA Inc
Publication of CN1973286A publication Critical patent/CN1973286A/en
Application granted granted Critical
Publication of CN100543738C publication Critical patent/CN100543738C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A system and method for optimizing database access for record linkage by tiling the space of record pairs are provided, the system including a processor, a segmentation and pairing unit in signal communication with the processor for segmenting database data into data segments and pairing the data segments into data quadrants, and a duplicate detection unit in signal communication with the processor for detecting duplicates for each quadrant; and the method including receiving database data, segmenting the database data into data segments, pairing the data segments into data quadrants, and detecting duplicates for each quadrant.

Description

By optimizing the database access that is used for record linkage to writing down right space piecemeal
The cross reference of related application
The application require in the name that on March 5th, 2004 submitted to be called " Optimizing DatabaseAccess for Record Linkage by Tiling the Space of Record Pairs ", sequence number is No.60/550, the rights and interests of the U.S. Provisional Application of 454 (acting on behalf of files No.2004P03682US) are introduced into as a reference with its integral body in this this application.
Background technology
The record linkage of database be the record that finds the identical entity of expression to or the problem of record set.For the large database that is not suitable for random access memory fully, all possible record is right relates to relatively repeatedly that database reads, so that the data recording that need be compared enters in the storer.This may be a kind of operation of time-consuming and poor efficiency.
In the former technology of considering, each database reads all and will be loaded into those records that will be compared in the storer, and those write down such as being those records with same block key value.There are some shortcomings in these methods.Shortcoming be this quantity big and therefore the number of times that reads of required database be big.Another shortcoming is that the piece size can change in wide region.For little piece, this method causes the waste of memory resource.For too big piece, it causes storer overflow error.
Therefore, wish to optimize the database access that is used for record linkage.
Summary of the invention
By being used for by optimizing these and other shortcoming and the defective that the canonical system of the database access that is used for record linkage and method solve prior art to writing down right space piecemeal.
Be used for by optimizing the canonical system of the database access that is used for record linkage and comprise processor, carry out being used for segmentation that database data is segmented into data segment and these data segments are paired into the data quadrant and pairing unit and carrying out the duplicate detection unit of being used for of signal communication of signal communication at each quadrant detection copy with processor with processor to writing down right space piecemeal.
Be used for by optimizing the typical method of the database access that is used for record linkage and comprise and obtain database data, database data is segmented into data segment, data segment is paired into the data quadrant and detects copy at each quadrant writing down right space piecemeal.
According to the following description to exemplary embodiments that should read in conjunction with the accompanying drawings, these of present disclosure and other aspects, feature and advantage will become apparent.
Description of drawings
According to following exemplary view, the present disclosure explanation is used for by optimizing the System and method for of the database access that is used for record linkage to writing down right space piecemeal, wherein:
Fig. 1 illustrate according to the illustrative embodiment of present disclosure, be used for by optimizing the synoptic diagram of the system of the database access that is used for record linkage writing down right space piecemeal; And
Fig. 2 illustrate according to the illustrative embodiment of present disclosure, be used for by optimizing the process flow diagram of the method for the database access that is used for record linkage writing down right space piecemeal.
Embodiment
Provide a kind of and be used for making when record linkage database to read minimized piecemeal technology, this piecemeal technology comprises by optimizing the database access that is used for record linkage to writing down right space piecemeal.The piecemeal technology is divided into a plurality of record linkage problems than small database with the record linkage of large database or duplicate detection problem, wherein can be loaded in the storer fully separately than small database.The number of times that this technology reads database minimizes, and the scope of piece size is dwindled, so that effective utilization of memory resource maximizes and avoid storer overflow error.
The example embodiment of present disclosure is guaranteeing that in a period of time any record minimizes the number of times that in will be in storer available database is read.In addition, these embodiment make the stable and maximization of the number that will read in the record in the storer at every turn reading.
As shown in fig. 1, according to the illustrative embodiment of present disclosure, be used for by optimizing the system of the database access that is used for record linkage and summarize with Reference numeral 100 and represent writing down right space piecemeal.System 100 comprises at least one processor or the CPU (central processing unit) (CPU) 102 of carrying out signal communication with system bus 104.ROM (read-only memory) (ROM) 106, random-access memory (ram) 108, display adapter 110, I/O adapter 112, user interface adapter 114 and communication adapter 128 also carry out signal communication with system bus 104.Display unit 116 carries out signal communication by display adapter 110 and system bus 104.Disk storage unit 118, for example disk or rom memory cell carry out signal communication by I/O adapter 112 and system bus 104.Mouse 120, keyboard 122 and eye movement tracking means 124 carry out signal communication by user interface adapter 114 and system bus 104.
Segmentation and pairing unit 170 and duplicate detection unit 180 also are included in the system 100, and carry out signal communication with CPU 102 and system bus 104.Though being shown as, segmentation and pairing unit 170 and duplicate detection unit 180 be coupled at least one processor or CPU 102, but preferably embody these assemblies with the computer program code that is stored at least one in storer 106,108 and 118, wherein this computer program code is carried out by CPU 102.
Turn to Fig. 2, according to the illustrative embodiment of present disclosure, be used for by optimizing the method for the database access that is used for record linkage and summarize with Reference numeral 200 and represent writing down right space piecemeal.Method 200 comprises beginning module 210, and this begins module and passes control to load module 212.Load module 212 receives database data and passes control to functional module 214.214 pairs of data database datas of functional module carry out segmentation and pass control to functional module 216.Functional module 216 is paired into section quadrant again and passes control to functional module 218.Functional module 218 detects copy and passes control at each quadrant and finishes module 220.
In operation, this technology allows the big task or the work of record linkage are divided into a plurality of less tasks or quadrant.Each quadrant is suitable for the RAM of processing unit fully.Therefore, can be on a CPU or on a plurality of independent CPUs, sequentially handle quadrant concurrently.
That large database is divided into is a plurality of, be s non-intersect and section that equate basically.Determine the number of the record in the section based on two parameters: the size of (1) memory span and (2) record; Make 2 sections to be loaded in the storer.Select the criteria for classifying, so that the section of reading in the storer is the most effective.For example, section is decided by the scope of Record ID.
From s section, it is right to form s (s-1)/2 section.Right by form each that be called quadrant in conjunction with segment number i and segment number j, i<j wherein.
At each quadrant, carry out duplicate detection.In brief, will be divided into s (s-1)/2 duplicate detection work to the duplicate detection work of database that N bar record is arranged to database that 2N/s bar record is arranged.In those work each all needs twice database to read.(s (s-1)/2+1) can arrange to handle the order of these work for the number of times that all (s (s-1)/2) databases of individual quadrant are read equals.An example of such order is as follows: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, s).This is to be used for guaranteeing the minimum number that will find any a pair of database in the N bar record to read simultaneously at storer.
The processing of quadrant is slightly different each other.For the quadrant (1,2) of section, consider that all are right.In each quadrant, in fact be not to consider to some extent to all comparing.To it must satisfy a certain condition before will being compared.That is to say to have only when recording identical piece key for two, two records are compared.Here, the piece key is the set of preassigned index, and the piece key value of record is the character string on those assigned addresses.For quadrant (1, i), if in (1) section of being recorded in 1 and among another section of being recorded in i or (2) two records all in section i, will consider a pair of two records so.(i, j) j>i>1 will be if among the section of being recorded in i and among another section of being recorded in j, will consider this two records so for quadrant.This flexible program is guaranteed and will be considered any a pair of in the N bar record in a unique quadrant.
Therefore, the number of times that database is read by (1) minimizes, and (2) maximally utilise retrievable memory span and (3) and guarantee that record not is to being compared the good performance that reaches this optimization technique for twice.
In the alternate embodiment of equipment 100, the some or all of of register storage computation machine program code on the processor chips 102 can be arranged in.In addition, can produce the various alternative configuration and the embodiment of other assembly of segmentation and pairing unit 170 and duplicate detection unit 180 and system 100.
The instruction that it should be understood that present disclosure can realize with various forms of hardware, software, firmware, application specific processor or their combination.Most preferably, the instruction of present disclosure is implemented as the combination of hardware and software.
In addition, software preferably is embodied as the application program that is comprised in really on the program storage unit (PSU).This application program can be uploaded on the machine that comprises any appropriate configuration and by this machine and carry out.Preferably, implement this machine on computer platform, this computer platform has the hardware of for example one or more CPU (central processing unit) (CPU), random-access memory (ram) and I/O (I/O) interface.
This computer platform also can comprise operating system and micro-instruction code.Various processing described herein and function can be can be by CPU a part or the part of application program or their combination that carry out, micro-instruction code.In addition, other various peripherals, for example additional data storage cell can be connected with this computer platform with print unit.
Should further be appreciated that since in the accompanying drawing assembly and the method for some construction systems of being described preferably realize with software, so the actual connection between system component or the processing capacity module can be according to the mode that present disclosure is programmed and difference.In this given instruction, those of ordinary skill in the related art can imagine these and similarly embodiment or configuration of present disclosure.
Though with reference to the accompanying drawings illustrative embodiment is described here, what it should be understood that is, present disclosure is not restricted to those clear and definite embodiment, and those of ordinary skill in the related art can do not depart from the scope of the present invention or the situation of spirit under realize variations and modifications.Variation and the modification of intention comprising all in the scope of the present disclosure of setting forth as appended claim.

Claims (20)

1, a kind of being used for by optimizing the method for the database access that is used for record linkage to writing down right space piecemeal, this method comprises:
Receive database data;
Database data is segmented into data segment;
Data segment is paired into the data quadrant; And
Detect copy at each quadrant.
2, the method for claim 1, wherein segmentation comprises large database is divided into a plurality of non-intersect and sections of equating basically.
3, the method for claim 1, wherein in response to the size of memory span and record, each section comprises many records, so that two sections are in the scope of memory span.
4, the method for claim 1, wherein a section right number that forms by s section be s (s-1)/2 section to or quadrant.
5, method as claimed in claim 4, wherein each to or quadrant by forming in conjunction with segment number i and segment number j, wherein i is less than j.
6, method as claimed in claim 4 wherein detects copy to the database that N bar record is arranged and is divided into s (s-1)/2 the duplicate detection task to database that 2N/s bar record is arranged, so that each carries out twice database and reads in those work.
7, method as claimed in claim 6, wherein arrange to handle the order of described work like this, so that the number of times that reads at the database of all (s (s-1)/2) individual quadrants is for (s (s-1)/2+1), this number of times are used for guaranteeing the minimum number that will find any a pair of database in the N bar record to read simultaneously at storer.
8, method as claimed in claim 7, the order of wherein handling quadrant work is: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, s).
9, the method for claim 1, wherein detect copy at each quadrant and comprise:
At the quadrant (1,2) of section, consider that all are right;
At quadrant (1, i), if in section of being recorded in 1 and among another section of being recorded in i or two records all in section i, consider a pair of two records so; And
At quadrant (i, j), j>i>1 wherein is if among the section of being recorded in i and among another section of being recorded in j, consider a pair of two records so;
Wherein will in a unique quadrant, consider any a pair of in the N bar record.
10, a kind of being used for by optimizing the system of the database access that is used for record linkage to writing down right space piecemeal, this system comprises:
At least one processor;
Carry out the segmentation and the pairing unit of signal communication with at least one processor, be used for database data is segmented into data segment and data segment is paired into the data quadrant; And
Carry out the duplicate detection unit of signal communication with at least one processor, be used for detecting copy at each quadrant.
11, system as claimed in claim 10 further comprises with processor carries out signal communication being used for receiving the input/output adapter of database data and at least one of communication adapter.
12, system as claimed in claim 10, wherein segmentation and pairing unit comprise the device that is used for large database is divided into a plurality of non-intersect and sections of equating basically.
13, system as claimed in claim 10, wherein segmentation and pairing unit comprise that the size that is used in response to memory span and record comprises many records so that the device in the scope of two sections in memory span at each section.
14, system as claimed in claim 10, wherein segmentation and pairing unit comprise be used for being defined as by section right number that s section forms s (s-1)/2 section to or the device of quadrant.
15, system as claimed in claim 10, wherein segmentation and pairing unit comprise be used for by combine segment number i and segment number j form each to or the device of quadrant, wherein i is less than j.
16, system as claimed in claim 10, wherein duplicate detection unit comprises the device that is used for coming in the following manner the database that N bar record is arranged is detected copy, be about to this task division and be s (s-1)/2 duplicate detection work, read so that each in those work is carried out twice database to database that 2N/s bar record is arranged.
17, system as claimed in claim 16, wherein duplicate detection unit comprises and is used for like this number of times that reads to the handling ordered of described work so that at the database of all (s (s-1)/2) individual quadrants for (device of s (s-1)/2+1), this number of times are to be used for guaranteeing the minimum number that finds any a pair of database in the N bar record to read at storer with simultaneously.
18, system as claimed in claim 17, wherein duplicate detection unit comprises that the handling ordered that is used for quadrant work is: (1,2) (1,3) ... (1, s) (2, and s) (2, s-1) ... (2,3) (3,4) (3,5) .. (s-1, device s).
19, system as claimed in claim 10, wherein duplicate detection unit comprises the device that is used for detecting at each quadrant copy, this device comprises:
The quadrant (1,2) that is used at section is considered the device that all are right;
Be used at quadrant (1, if i) in section of being recorded in 1 and among another section of being recorded in i or two records all in section i, consider the device of a pair of two records so; And
Be used for that (i is if consider the device of a pair of two records, wherein j>i>1 j) among the section of being recorded in i and among another section of being recorded in j, so at quadrant;
Wherein will in a unique quadrant, consider any a pair of in the N bar record.
20, a kind of machine-readable program storage device, this program storage device comprises really can be by the instruction repertorie that is used for program execution process of this machine execution, this program step is used for by optimizing the database access that is used for record linkage to writing down right space piecemeal, and this program step comprises:
Receive database data;
Database data is segmented into data segment;
Data segment is paired into the data quadrant; And
Detect copy at each quadrant.
CNB2005800068291A 2004-03-05 2005-03-02 By optimizing the database access that is used for record linkage to writing down right space piecemeal Expired - Fee Related CN100543738C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US55045404P 2004-03-05 2004-03-05
US60/550,454 2004-03-05
US11/067,992 2005-02-28

Publications (2)

Publication Number Publication Date
CN1973286A true CN1973286A (en) 2007-05-30
CN100543738C CN100543738C (en) 2009-09-23

Family

ID=38113177

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005800068291A Expired - Fee Related CN100543738C (en) 2004-03-05 2005-03-02 By optimizing the database access that is used for record linkage to writing down right space piecemeal

Country Status (1)

Country Link
CN (1) CN100543738C (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112817767A (en) * 2021-02-24 2021-05-18 上海交通大学 Method and system for realizing optimization of graph computation working set under separated combined architecture
CN113748388A (en) * 2019-03-01 2021-12-03 西门子股份公司 Method and apparatus for computer-aided optimization of tool occupancy of library space

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113748388A (en) * 2019-03-01 2021-12-03 西门子股份公司 Method and apparatus for computer-aided optimization of tool occupancy of library space
CN112817767A (en) * 2021-02-24 2021-05-18 上海交通大学 Method and system for realizing optimization of graph computation working set under separated combined architecture

Also Published As

Publication number Publication date
CN100543738C (en) 2009-09-23

Similar Documents

Publication Publication Date Title
US6952794B2 (en) Method, system and apparatus for scanning newly added disk drives and automatically updating RAID configuration and rebuilding RAID data
CN1760875B (en) Transparent migration of files among various types of storage volumes based on file access properties
CN100409240C (en) System and method for efficient file content searching within a file system
CN1113291C (en) Automatic configuration generation
US8271991B2 (en) Method of analyzing performance in a storage system
US7353496B2 (en) Storage controller software development support system and software development support method
CN102236699B (en) For quick superscale process is to normalized
US20140325148A1 (en) Data storage devices which supply host with data processing latency information, and related data processing methods
US20070005556A1 (en) Probabilistic techniques for detecting duplicate tuples
CN100590596C (en) Multi-node computer system and method for monitoring capability
US11599463B2 (en) Servicing queries during data ingress
US7403936B2 (en) Optimizing database access for record linkage by tiling the space of record pairs
US20080168226A1 (en) Correction method for reading data of disk array system
CN101196889A (en) Document placing method and device of optimizing memory system
US20240118939A1 (en) Utilizing key value-based record distribution data to perform parallelized segment generation in a database system
CN100543738C (en) By optimizing the database access that is used for record linkage to writing down right space piecemeal
CN1679009B (en) Method and apparatus to permit external access to internal configuration register
Kotz Expanding the potential for disk-directed I/O
US7162665B2 (en) Information processing system, method for outputting log data, and computer-readable medium storing a computer software program for the same
CN107632779A (en) Data processing method and device, server
JPH08129461A (en) Auxiliary storage device
US11983424B2 (en) Read disturb information isolation system
US11763898B2 (en) Value-voltage-distirubution-intersection-based read disturb information determination system
Patki et al. Computer Hardware Devices in Efficient E-Servicing: Case Study of Disk Scheduling by Soft Computing
Agrawal et al. Comparative Analysis of RocksDB, LMDB, and MongoDB: A Performance Evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090923

Termination date: 20120302