WO1997011433A1 - Realisation d'operations d'association de grands tableaux - Google Patents

Realisation d'operations d'association de grands tableaux Download PDF

Info

Publication number
WO1997011433A1
WO1997011433A1 PCT/US1996/015221 US9615221W WO9711433A1 WO 1997011433 A1 WO1997011433 A1 WO 1997011433A1 US 9615221 W US9615221 W US 9615221W WO 9711433 A1 WO9711433 A1 WO 9711433A1
Authority
WO
WIPO (PCT)
Prior art keywords
join
records
index
input
tables
Prior art date
Application number
PCT/US1996/015221
Other languages
English (en)
Inventor
Zhe Li
Kenneth A. Ross
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/531,789 external-priority patent/US5666525A/en
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Publication of WO1997011433A1 publication Critical patent/WO1997011433A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations

Definitions

  • the present invention is directed towards an efficient join operation in a database system, and more specifically toward joining tables containing a large amount of data using a join index in a system with a relatively small amount of main memory.
  • a parallel-merge operation is used as part of the technique.
  • join operation used in database systems is the fundamental operation that allows information from different data tables to be combined in a selected way.
  • Tables consist of collections of data grouped by a common subject matter, such as names or ages.
  • the join operation allows a user to combine selected groups of data from multiple input data tables according to a specified condition between the records in each table to be combined. For example, a user may want join two tables, the first containing names and telephone numbers, and the second containing names and addresses, to produce a result which contains all the records that include common names between the two tables with both their telephone numbers and addresses.
  • the join operation is used during a "query", or data request operation, by the user of a database system. Join operations are typically expensive in terms of processing time making efficiency a critical component to performance of the database system.
  • tables of data will be very small relative to the amount of main memory (also called random access memory or "RAM") of the computer in the data base system.
  • main memory also called random access memory or "RAM”
  • An example of a relatively small table is one that contains data of 30 students names and their quarterly grades. Tables are normally organized by columns of related data. In this case, all the names would appear in one column entitled "Names". In this example of 30 records, the entire table can be read into a typical main memory of a computer in one operation and can be completely processed while contained in RAM.
  • Join results from joining input tables can be computed in a number of different ways.
  • the term "ad- hoc join” is used to describe the process of taking two input tables in a database and forming the join result by processing the entire standard representations of each table, without the benefit of any pre-computed special data structures such as indices. When the pre ⁇ computed special data structures are present, the join operation will perform more efficiently.
  • join index One such pre-computed access structure is called a join index.
  • the join index was introduced by Valduriez in "Join Indices", ACM Transactions on Database Systems, 12 . (2) :218-246, 1987.
  • a join index between two input tables maintains pairs of identifiers for records that would match if a particular join operation is performed. For example, one entry in a join index might be (1,3) which indicates that the record with an identity of "1" in the first table will be joined with the record with an identity of "3" in the second table when a particular join operation is performed.
  • the join index may be maintained by the database system, and updated when records are inserted or deleted in the underlying tables.
  • Valduriez proposes a technique to perform join operations using a join index in his article.
  • the described method for performing the join operation requires a large amount of main memory (see Valduriez article, page 223) .
  • the four steps of the Valduriez method for performing the join operation using a join index on two input tables designated A and B stored in a database are: (1) read in a portion of the join index (previously constructed) and selected matches between the join index entries and the records in A which will fit into available main memory; (2) internally sort the records from A and the join index entries by their table B identifiers; (3) select matches between the join index entries and the records in B outputting the resulting record; and (4) read in the next portion of the join index and the selected records of A. Repeat steps 2, 3 and 4 until all index entries in the join index are processed.
  • the Valduriez method has significant drawbacks • when the input tables are large and the processor uses a relatively small main memory.
  • An important consideration in measuring the efficiency of a join operation is the number and kind of disk accesses performed during that operation.
  • the "cost" (processing time) of I/O operations becomes increasingly important when the tables to be joined are very large.
  • a typical access of one block (typically 8000 bytes) of data stored on a disk using a conventional Fujitsu M2266 one gigabyte drive is about two milliseconds.
  • the cost of rotational latency (moving the disk in a circular motion to the specified address) for that drive is about eight milliseconds.
  • the seek time (moving the disk head up and down to the proper location on the disk) is about 9.5 milliseconds.
  • Other drives have proportionate access times.
  • a method and system for performing a join operation in a database system combining multiple tables comprised of individual records using a join index uses a parallel-merge technique as part of the overall method.
  • the method and system of the present invention performs most efficiently when the input tables are large and the main memory used by the processor is small relative to the size of the input tables.
  • Two embodiments of the technique for joining tables using a join index requires reading each input table only once and writing the output data to an output file or files only once.
  • An alternate embodiment requires additional input/output operations in order to process each input table completely independently providing the advantages of parallel processing.
  • the technique only reads data blocks containing selected records from the input tables which will be part of the join result. Minimizing the number of I/O operations by reducing the number of necessary read and write operations is very important for the efficiency of a join operation when the input tables are very large because I/O operations are very time consuming.
  • the method of the present invention has three embodiments, all using the parallel-merge technique which sorts multiple files efficiently by identifying the lowest value in all the files.
  • the first embodiment is named “slam-join” and is applicable to joining two input tables efficiently while using the parallel-merge technique.
  • the second embodiment is named “parallel-join” and applies the parallel-merge technique to joining three or more input tables at one time.
  • the third embodiment is named “multi-slam-join” which is applicable to joining three or more input tables at one time and combines the benefits of slam- join with the "jive-join” technique described in the parent application.
  • the slam-join method includes the following steps: First, a pre-existing join index and records specified in the join index from the first table are read until available main memory is full . The records are sorted by their corresponding second table join index entries and stored in first output files. The join index entries are also stored in temporary files. The next set of first table records and join index entries are read and processed until the entire selected first table is done. Buffers are then allocated for each temporary file and the join index entries in all the temporary files are parallel-merged. The identified records from the second input table are sequentially read and placed in the appropriate buffers. The records from each buffer are then written to an output file and joined together in the same order as the first output files to ensure that the records from each table will achieve the proper join result specified in the join index. Each table at the end of the join operation has a separate file or collection of associated files for its associated output which helps minimize the number of I/O processes by allowing the processor with a small main memory to completely process each input table only once.
  • the parallel-join embodiment processes each input table completely independently using the parallel-merge technique.
  • Each input table is partitioned into temporary files of data which fit into available main memory based on the join index entries.
  • the files are then sorted and parallel-merged to read in sequentially the records for the given input table.
  • the records are then written to output files in the original order of the temporary files to ensure that the records from each table will achieve the proper join result specified in the join index.
  • the multi-slam-join embodiment uses the elements of the jive-join technique described in the parent application and applies an adaptation of the slam-join method to the first and last input tables while processing the intermediate input tables using an adaptation of the jive-join technique.
  • the result is an improvement on the jive-join technique by reducing the number of allocation steps.
  • a join index is not pre-existing for a group of tables to be joined, an index can be created by using a number of different methods.
  • One way is called the nested-loop method which checks for a possible join result between each record of every table against the records in all the other tables. After a join index is created, the parallel-merge based techniques can be used.
  • the slam-join, multi-slam-join and parallel-join methods work for relational databases, object-oriented databases, and other types of database which support some type of record indexing system.
  • One system platform which supports the implementa ⁇ tion of the parallel-merge based methods includes a computer system with a central processing unit, limited main memory (RAM) , a mass storage medium which contains input tables of large amounts of data and space to store the results of the join operation, and connectors to transfer data between the central processing unit and the mass storage medium.
  • RAM limited main memory
  • Figure 1 is a flow chart of the slam-join method in accordance with the invention
  • Figure 2 is an expanded flow chart of the parallel-merge step of Fig. 1;
  • Figure 3 is a graphical representation of two input tables and a join index
  • Figure 4A is a graphical representation of part of the intermediate and output files created during the slam-join method when joining the two tables in Fig. 3;
  • Figures 4B and 4C are further graphical representations of part of the intermediate and output files created during the slam-join method when joining the two tables in Fig. 3;
  • Figures 5A-5E are graphical representations of the intermediate files created during the parallel-merge portion of the slam-join method when joining the two tables in Fig. 3.
  • Figure 5F is a graphical representation of the join result of the two tables in Fig. 3 using the join index in Fig. 3;
  • Figure 6 is a flow chart of the parallel-join method in accordance with the invention.
  • Figure 7 is a graphical representation of three input tables and a join index
  • Figures 8A, 8B and 8C are graphical representa- tions of part of the intermediate and output files created during the parallel-join method when joining the three tables in Fig. 7;
  • Figure 8D is a graphical representation of the join resort of the three tables and join index in Fig. 7 using the parallel-join method
  • FIG. 9 is a flow chart of the multi-slam-join method in accordance with the invention.
  • Figure 10 is a graphical representation of a multi-dimensional buffering scheme used in the multi- slam-join method when three tables are joined;
  • Figures IIA, IIB and IIC are graphical representa ⁇ tions of part of the intermediate and output files created during the multi-slam-join method when joining the three tables in Fig. 7;
  • Figure 11D is a graphical representation of the join result of the three tables in Fig. 7;
  • Figure 12A is a graphical representation of two tables upon which a join index has not yet been created
  • Figure 12B is a flow chart of the steps of the nested loop method which creates a join index for two tables;
  • Figure 12C is a graphical representation of the join index created by the method shown in Figure 12B when applied to the input relations shown in Figure 12A;
  • Figure 13 is a graphical representation of two input object collections in an object-oriented data base
  • Figure 14 is a graphical representation of a local selection which can be used in conjunction with the slam-join technique
  • Figure 15 is a graphical representation of the operation of a hybrid join index which can be used in conjunction with the slam-join technique.
  • Figure 16 is a schematic diagram of a computer system upon which the slam-join, multi-slam-join and parallel join method can be implemented effectively.
  • the present invention is a join technique in a database system using a join index which will efficiently process large input tables in a join operation to produce a desired join result when the processor used for the operation uses a relatively small main memory.
  • the technique minimizes necessary I/O operations and maximizes the use of the available main memory.
  • Three embodiments of the technique are described herein, all of which use a parallel-merge operation which is described in detail in Fig. 2.
  • the first embodiment of the technique will be called “slam-join.”
  • Slam-join is used for combining two tables during a single join operation.
  • the second embodiment will be called “parallel-join” and is used for combining three or more tables with a single join operation.
  • FIG. 1 is a flow chart showing the steps performed in the slam-join technique for joining two tables of data.
  • the technique is preferably performed in a main memory which is relatively smaller than a secondary memory or memories containing the two input tables in order to maximize its advantages over other techniques.
  • main memory is that portion of memory, usually random access memory (RAM) , which is not allocated to another specific function.
  • Step 101 reads into available main memory a portion of the join index and a portion of records from the first input table (designated Rj in this description) identified in the join index portion which was just read.
  • the portion of records and join index entries read into available main memory will be as much as will fit into the memory.
  • the size of the records from R[ identified in the join index could be two, three, fifty or even one thousand times or more times the size of available main memory.
  • the number of actual records read into available main memory at one time will depend on the size of each record and the size of available main memory and the data distribution in the join index. Data will be read in disk blocks which typically include many records.
  • Step 103 then appends the associated memory addresses of the R, records stored in available main memory to the corresponding R 2 RIDs in the join index entries also stored in available main memory.
  • the use of memory addresses will increase processing speed when the R, records are sorted by their associated R 2 join index RIDs in a future step.
  • the addresses themselves will be sorted with the R 2 RIDs instead of the actual records. It is faster to sort the addresses rather than the records because the records are much larger in size than the addresses.
  • Step 105 sorts the R 2 join index RIDs in ascending order and correspondingly places the appended R, record memory addresses in the R 2 RID order.
  • the R, records themselves could be physically sorted and moved instead of only their addresses although that approach would require more processing time and would therefore be less desirable most of the time.
  • Step 107 then writes the R, records in the order of the sorted memory addresses to a separate R, output file designated for each pass.
  • Step 107 is accomplished by writing the contents of each sorted memory address location which are the corresponding R, records.
  • the sorted R 2 identifiers are also written to a separate temporary file associated with each R, output file which will be used later in the technique.
  • the temporary file containing the R 2 identifiers may be stored in part of main memory or may be stored in a file located outside of main memory.
  • Step 109 checks if there is any portion of the join index (and thus R, records) which has not yet been read. If there is a remaining portion, the technique jumps back to step 101. Steps 101-107 will be then repeated as necessary to read in the entire join index and the identified R, records.
  • Step 111 clears available main memory and allocates a number of buffers in the available main memory equal to the number of passes required for reading in the full join index and corresponding R j records.
  • Each buffer will eventually contain the R 2 records which correspond to the R 2 RIDs index stored in the temporary files for each pass performed.
  • Each buffer will contain a separate merge file which consists of a portion of the R 2 RIDs of an associated temporary file.
  • These merge files will be illustrated in Figure 5A.
  • R 2 RIDs indicate the location of the R 2 records which correspond to the previously stored R, records.
  • a graphical example shown in Fig. 5A shows the buffer structure and explain in detail its operation.
  • Step 113 performs a parallel-merge operation on the R 2 RIDs stored in the merge files.
  • the merge files are filled with as much of the R 2 RIDs from the corresponding temporary files as will fit into available main memory when the identified records in the merge files are read.
  • the size of the merge files is an easy calculation once the size of the records, number of passes and size of available main memory is known.
  • the R 2 RIDs are then merged such that the lowest value RID is identified and the corresponding R 2 record is read from the second input table and placed in the corresponding buffer associated with the R 2 RID just selected.
  • the identifier in the merge file is removed and the next identifier from the merge file becomes the lowest in the individual merge file (the R 2 RIDs have been previously sorted) .
  • the technique again identifies the lowest identifier between the current merge files so that the records from R can be read in a sequential order. When empty, the merge files will be filled again from the temporary files of any RID which has not yet been processed.
  • the parallel-merge technique in step 113 is explained in further detail in Figure 2.
  • the parallel-merge technique is also shown in an example in Figs. 5A-5E.
  • All the R 2 records are processed sequentially in this manner.
  • the R 2 records do not need to be read immediately after the identifier is selected, rather a secondary buffer or buffers can be used to store a portion of R 2 records, only some of which will be read.
  • the individual records can then be accessed from the secondary buffer.
  • the buffer can be located in the processing computer or can be an external buffer.
  • a secondary buffer storing locations of R 2 records to be read can be used instead of immediate disk access until a sufficient number are present to efficiently access the R 2 data storage medium. This allows the technique to save I/O operations by sequentially reading the R 2 records and ensuring that disk blocks do not need to be read twice and that the disk storing the input table is not randomly accessed.
  • FIG. 2 is a flow chart of the preferred technique for performing a parallel-merge operation on multiple temporary files formed during the slam-join technique shown in step 113 of Figure 1.
  • the parallel- merge technique described is also used in the other embodiments of the invention described herein.
  • the parallel-merge operation can be performed in other ways and the invention is not limited to the use of the technique described in Fig. 2.
  • the parallel-merge process starts with at least two merge files containing R 2 RIDs.
  • the merge files contain the RIDs from corresponding temporary files which are already in ascending sorted order from a prior step.
  • the merge files will be merged in parallel such that the lowest identifier from all the merge files in memory will be selected in order to read the first record located in the second input table being joined.
  • the merge files are preferably located in an in-memory buffer.
  • the merging operation ensures that the access of the second input table will be performed sequentially from the lowest R 2 identifier in all the merge files to the highest R identifie
  • the parallel-merge operation in the present invention uses a conventional min-heap technique for determining the lowest R 2 identifier.
  • Each of the merge files reflect the prior sorting in the temporary files, so that the RIDs will be in ascending order from the lowest value to the highest value.
  • the lowest R 2 identifier for each merge file is placed in a min-heap data structure which organizes the identifiers in a logical tree type fashion with the lowest identifier becoming the root (an example of the min-heap tree is shown in Figs. 5B) .
  • the lowest identifier is used to read a R 2 record into available main memory, it is removed from the merge file.
  • Step 201 in the parallel-merge technique fills each merge file with the R 2 RIDs located in the associated temporary files and corresponding to a particular allocated buffer.
  • the R, records read and processed in the first pass will have an assigned buffer, assigned merge file, and assigned temporary files which contains the sorted R 2 identifiers from the join index corresponding to the read R, records.
  • the merge file can hold more RIDs than allocated R 2 records because of their smaller memory requirements.
  • each buffer is able to contain three records at one time, the corresponding merge file will hold at least three RIDs.
  • the capacity of each buffer is dependent on the size of available main memory and the size of the records.
  • each merge file is filled to its capacity with the contents of corresponding temporary files at the beginning of the parallel-merge operation. Additional RIDs from a temporary file not yet processed will be added to the merge file after the initial RIDs in the merge file have been processed.
  • Step 202 places the lowest RID in each merge file into a min-heap organization.
  • the identifiers do not need to be physically moved in memory but a pointer could be created to point to each of the lowest RID values.
  • the min-heap organization is akin to a logic tree structure. Examples of these structures can be seen in Figs. 5B-5E which describe an example of a join operation of two input tables.
  • Step 203 reads the lowest R 2 record identified by the lowest R 2 RID in the min-heap structure (the base of the tree) and stores it in the corresponding predefined buffer.
  • the buffer which receives the R 2 record will be the buffer corresponding to the merge file which was selected as containing the lowest RID.
  • Step 205 flushes the buffer to which a record has just been written to outside files if necessary.
  • the "flush" process will occur if the buffer is full and cannot hold any more R 2 records. If the buffer is not full or the flush is complete, then the process continues.
  • Step 207 then transfers the next lowest RID from the merge file from which the R 2 record was just read to the node in the min-heap organization which was just identified as the lowest RID processed. If the merge file is then empty it is filled again with the RIDs from its associated temporary file. The temporary file is already in sorted order from previous steps in Fig. l so the lowest RID can be readily accessed. If there are no more RIDs which have not yet been processed in the temporary file of the record just read, then no new RIDs will be placed in the merge file. If there are no RIDs remaining in a merge file, then the buffer and corresponding merge file can be unallocated in available main memory and that memory space reallocated to expand the remaining buffers. This will allow more RIDs to be processed at one time for a given buffer and merge file because of the additional memory space.
  • Step 209 checks if at least one RID remains in any of the identifier merge files. If at least one RID remains, the process continues with step 211. If no RIDs remain in any merge file, the parallel-merge process is near completion and continues with step 213. Step 211 reorganizes the min-heap structure which contains the lowest RID from each of the remaining merge files. The lowest RID in the heap now becomes the base of the tree which identifies it as the low RID. The actual technique of reorganization of a min- heap is well known in the art. The process then jumps to step 203.
  • Step 213 flushes all the buffers if any records remain in any of the buffers and writes the records to an output file which corresponds to the buffer.
  • Some of the buffers may not be completely full at the end of the parallel-merge operation, so they will not be automatically flushed in step 205, and require flushing at this point in the technique.
  • the output files which all correspond to a particular buffer are then logically joined by pointers from the end of the one output file to the beginning of another. Alternatively, the memory address of the beginning of each output files could be stored in a location table.
  • Figures 4A-4D are graphic depictions of an example where the slam-join technique is applied to perform a join operation on two tables in a database system.
  • the data is organized into tables of related data.
  • the example is only illustrative and is not meant to limit the scope of the invention.
  • a join operation using three tables will be described in a later example.
  • Figure 3 shows a first input data table 301 which includes columns of data entitled Student name 301A and their Course number 301B. Also included with table 301 is a column of sequential record identifiers 3OIC which shows the order of the records in this example.
  • a set of numbers 3OIC does not in practice appear with the table 301, but are present only for this explanation.
  • Table 305 is a table to be joined with table 301 and includes columns of data entitled Course number 305A and its Instructor 305B. Also included for this explanation is a column of sequential record identifiers 305C which shows the order of the records in this example.
  • the data contained in the two tables are to be joined together producing a join result only of records of each table which fulfill a desired join condition. For this example, only records from first table 301 which have the same course number as records in second table 305 will be in the join result. Additionally, all columns in first table 301 and second table 305 will be present in the join result so that the join result will have a student column, a course column and an instructor column. It is also possible to select only some of the columns from the input tables to become part of the join result. While each input table contains only nine records, this number is for the ease of this example only. When this example is extrapolated out to actual size contemplated for use, it is expected that thousands, millions, or even billions of records would be processed.
  • the join index 303 shows pairs of related records which will be present in the join result.
  • the join index is considered to be pre-existing for use with the slam-join technique, although a join index can be created if needed and is discussed below (See discussion of Figures 12A-12C) .
  • the join index is ordered by the R, RID value.
  • the join index contains a first column 303A of the record identifiers of the first table 301 records to be joined, a second column 303B of the record identifiers of the second table 305 records to be joined, and a third column 303C included only for this explanation, which is a column of sequential record identifiers 305C - to show the order of the records.
  • Each entry in each record in the join index is called a record identifier.
  • Column 303B is thus made of second table 305's RIDs values.
  • the join index represents matches of the course number of the student from table
  • Figures 4A-4C show some of the intermediate and output results of the slam-join technique applied to the input tables in Fig. 3. This example is shown to help explain the operation of the slam-join technique.
  • the intermediate results are the product of applying the slam-join technique set out in Fig. 1 to the input tables and join index in Fig. 3.
  • Fig. 1 first a portion of the join index and identified R, records are read into the available main memory.
  • main memory As many records and record identifiers from the join index as will fit into available main memory will be read in one pass. Portions of the main memory may be reserved for sorting techniques, storing temporary files or file manipulation and are not considered part of available main memory. In this example of Fig. 4A-4C, available main memory is capable of holding three full records at one time. Additionally, all the records are the same size. The records could be varied in size in actual practice depending on the type of data and number of columns. The size of the record identifiers are not significant in this example because they are much smaller than the records themselves.
  • the first three records of the first input table R, identified in the join index are read from an outside storage disk and stored in column 401 of Fig. 4A in available main memory.
  • the corresponding R 2 identifiers to the R, records related in the join index are also stored in available main memory in column 405 (step 101 of Fig. 1).
  • the storage location of the two columns can be at different places in memory as long as a pointer or location table tracks the beginning of each column.
  • the physical memory addresses of the R] records are then appended to the corresponding R identifiers and are shown in column 403 (step 103 of Fig. 1) .
  • the memory addresses are the actual addresses where each R, record is stored.
  • Logical memory addresses could also be used depending on the system platform organization. Although shown in a table format in this example, the data can be stored in any conventional manner as long as the location of the beginning of each column is stored or has a pointer associated with it.
  • the memory addresses in column 403 are shown as "0x01", “0x02", and "0x03" and could be any addressable memory location in available main memory.
  • Step 105 in Figure 1 when applied to this example sorts the R 2 RIDs in ascending order while also reordering the memory Rj addresses in the sorted order of the R 2 RIDs.
  • Column 407 of Fig. 4A shows the R 2 RIDs in sorted order and column 409 shows the memory addresses in the sorted order of the R 2 RIDs.
  • the third RID in column 407 is now "9" and the third address in column 409 is "0x02".
  • Step 107 in Figure 1 when applied to this example writes the Ri records to an output file 411 of Fig. 4A based on the sorted R 2 RID order. This is most efficiently accomplished by simply writing the R, records to output file 411 sequentially in the order of the sorted memory addresses in column 407. When this is done, the output file 411 is created containing R, records in the order of the sorted R 2 RIDs. The R 2 RIDs are also written to a temporary file 413 for later processing.
  • Figure 4B shows more intermediate and output files which are generated during the operation of the slam- join technique in this example.
  • additional R records not yet processed are read into available main memory (after checking for additional records in step 119 of Fig. 1) . Again, as many records and record identifiers as will fit into available main memory will be read in each pass.
  • three more R records and their corresponding record identifiers are read and stored in available main memory.
  • the next three records identified in the join index are read from R, and stored in column 421 in available main memory.
  • the corresponding R 2 identifiers from the join index are also stored in available main memory in column 425.
  • R, record #6 (“Kirby") has two R 2 RID join index entries and the record will be joined in the final join results with two records from R 2 . All of the R 2 RIDs associated with each R, record are included in the R 2 identifier column 425. The R, record with multiple R 2 RIDs need only be read once from its outside storage location. The memory addresses of the R, records are then appended to the corresponding R 2 RIDs and are shown in column 429. The addresses can be different addresses for those in column 403 of Fig. 4A or can be the same addresses used again because the initial processing of memory addresses for the first pass in Fig. 4A is complete.
  • Step 105 in Figure 1 when applied to this example sorts the R 2 RIDs in ascending order while also reordering the memory R, addresses in the sorted order of the R 2 RIDs.
  • Column 427 shows the R 2 RIDs in sorted order and column 429 shows the memory addresses in the sorted order of the R 2 RIDs.
  • the third RID in column 427 is now "6" and the third address in column 429 is "0x03".
  • Step 107 in Figure 1 when applied to this example writes the R ] records to an output file 431 based on the sorted R 2 RID order. This is efficiently accomplished by simply writing the R, records in the order of the sorted memory addresses in column 427. When this is done, the output file 431 is created containing R, records in the order of the sorted R 2 RIDs. The R 2 RIDs are also written to a temporary file 433 for later processing.
  • Figure 4C shows the intermediate and output files for the final pass of the first input table 301.
  • the final two R, records identified in the join index and their associated R 2 RIDs are read and stored in available main memory.
  • the last two R, records identified in the join index are read from the first input table and placed in column 441 stored in available main memory.
  • the corresponding R 2 identifiers from the join index for each R, record are also stored in available main memory in column 445.
  • the memory addresses of the R, records are then appended to the corresponding R 2 identifier and are shown in column 445 of Fig. 4C.
  • Step 105 in Figure 1 at this point in the example sorts the R 2 RIDs associated with the last two records in ascending order while also reordering the memory R !
  • Step 107 in Figure 1 at the point in the example writes the R, records to an output file 451 based on the sorted R 2 RID order. This is most efficiently accomplished by simply writing the R, records to output file 451 sequentially in the order of the sorted memory addresses in column 447. When this is done, the output file 451 contains R] records in the order of the sorted R 2 RIDs. The R 2 RIDs are also written to a temporary file 453 for later processing.
  • the first phase of processing in the slam-join technique corresponding to steps 101-109 in Figure l is now complete because all RIDs in the join index have been processed.
  • the R] records have been written to output files 411, 431 and 451 which can be logically connected by pointers or a location table to form the equivalent of one output file.
  • the output file is vertically partitioned because the R] records are processed and stored separately from the R 2 records.
  • Figure 5A-5D show the intermediate results of the second phase of the slam-join technique. After all the records from R, are read, processed, and stored in output files, the second phase of reading and processing the R 2 records is performed.
  • a buffer is created for each pass performed in the first stage along with an associated merge file for each buffer (step 111 of Fig. 1) .
  • three passes were required to read in all of the R ] records, therefore three buffers will be created.
  • buffer 507 corresponds to the R
  • records and R 2 RIDs read in the first pass first pass shown in Fig. 4A
  • buffer 509 corresponds to the second pass (second pass shown in Fig. 4B)
  • buffer 511 corresponds to the third and last pass (third pass shown in Fig. 4C) .
  • Merge file 501 (designated by “(a)") is associated with buffer 507 (and thus the first pass) ;
  • Merge file 503 (designated by “(b)") is associated with buffer 509 (and thus the second pass) ;
  • merge file 505 (designated by "(c)") is associated with buffer 511 (and thus the last pass) .
  • each buffer can hold only one record at a time so that when all three buffers are full, three records will be stored in available main memory at one time which is the memory designated maximum capacity in this example.
  • the presorted temporary files containing the R 2 RIDs are located in an outside memory (not in available main memory) are then loaded into the merge files corresponding to each temporary file.
  • each temporary merge file will hold two RIDs at one time.
  • the number of RIDs in a merge file depends on the amount of available main memory. For this explanation, each merge file will hold two RIDs.
  • Each merge file is loaded with the lowest RID numbers from the corresponding temporary file (already in sorted order) so that merge file 501 will contain RID values "1" and "4"; merge file 503 will contain RID values "2" and "5"; and merge file 505 contains RID values "2" and "3" after the initial loading.
  • This loading step corresponds to step 201 in Fig. 2.
  • the remaining RIDs for each merge file are shown outlined in a broken line in Fig. 5A.
  • the empty buffers and filled merge files at this point are shown in Fig. 5A.
  • Figure 5B shows the initial min-heap structure used for identifying the lowest RID value from each of the merge files during the parallel-merge operation (step 113 of Fig. 1) .
  • the min-heap structure is in the form of a tree 520 which has as its root the lowest value RID from the merge files. If any node containing a value in the tree is replaced, the tree will be reorganized and the lowest value will become the root of the tree by a technique well known in the art.
  • node 521 corresponds to the lowest RID value for merge file 501
  • node 523 corresponds to the lowest RID value for merge file 503
  • node 525 corresponds to the lowest RID value for merge file 505.
  • Each node has a pointer or stored value which identifies which merge file the RID value came from. This allows the min-heap technique to identify the buffer in which to place the R 2 record corresponding to the lowest RID value in the min-heap tree when the record is read into available main memory. Placing the RID values in the min-heap structure corresponds to step 202 in Fig. 2.
  • Figure 5C shows the min-heap structure transition when one R 2 record is read which has the lowest RID value remaining.
  • Tree 530 shows the initial tree structure generated at the beginning of the phase (same as Fig. 5B) . Tree 532 then replaces the minimum RID value whose identified R 2 record was just read with the next RID from the corresponding merge file.
  • the value of "1" in node 535 was replaced with the RID value of "4" from merge file 501.
  • the RID value of "4" was the next minimum value in merge file 501.
  • the min-heap tree must be reconfigured to identify the minimum value of all the current nodes.
  • the tree is reconfigured to look like tree 534 using a technique well known in the art.
  • the min-heap tree 534 now contains the lowest remaining RID values for each of the merge files 501, 503 and 505.
  • the root of the min-heap tree, node 536 is the lowest RID value.
  • the process of reading the R 2 record corresponding to the identified lowest remaining RID value is repeated until all the R 2 records identified in the join index (and thus in the temporary files and the merge files) are read from the outside storage medium and written to a proper output file.
  • the corresponding R 2 data can be read immediately from its outside storage medium. However, it is more efficient to store the location of the read requests until a sufficient number of requests are stored and blocks of data can be read together in sequence which saves I/O operations and improves efficiency.
  • Figure 5D shows a representation of the min-heap tree structure when one of the merge files becomes empty because all the RIDs for that merge file have been processed.
  • Tree structure 540 represents an iteration in this example when the minimum RID values for each of the three merge files are "3", "4" and "5" as shown in the Fig. 5D.
  • One of the R 2 RID values is the root in the tree structure indicating that it is the minimum R 2 RID value.
  • the RID value "3" from merge file 505 is the minimum value.
  • FIG. 5E shows a representation of the min-heap tree structure when a second merge file and associated temporary file is now empty. Tree structure 550 represents an iteration in this example when the minimum R 2 RID value for the remaining two merge files are "7" and "9".
  • File 5F shows a graphic representation of the connected output files for both R, and R 2 .
  • File 560 is made up of output files 411, 431 and 451 of the identified R ! records.
  • File group 562 is the connected output files for the identified R 2 records. The contents of each file correspond to the relationship expressed in the join index. While the exact order in the output files is not the same as in the join index, the join relationship is preserved. The output files could be sorted together if a different order is desired.
  • the slam-join technique can be extended to joining three or more tables simultaneously by using either one of two further embodiments, referred to as the parallel-join technique and the multi-slam-join technique. A comparison of the techniques is made in later sections.
  • Figure 6 is a flow chart showing the steps performed in the parallel-join technique for joining three or more tables in one join operation. While the parallel-join technique can be used for joining only two tables, the slam-join technique is the preferred technique for that situation.
  • Step 601 reads into available main memory all the RIDs for the first input table R, from the join index.
  • Step 603 then partitions the Rj RIDs on the RID values into groups based on the number of independent processors available.
  • the file in available main memory associated with each group is called a buffer.
  • An example of a partitioning value is "RID ⁇ 3".
  • the R, records read in sequentially from the outside storage medium will be assigned to the appropriate buffer based on the value of its R, RID.
  • the R, RIDs in the join index are assumed to be in numerical order at the start of the join operation. If they are not, a sort can be performed before the parallel-join operation is implemented to place them in that order.
  • Step 605 reads in the R 1 records for each buffer.
  • the data is actually read in disk blocks which can contain multiple records. If one processor is used, the RID values will be read in ascending order starting with the lowest RID value.
  • the buffers will be processed one at a time. If multiple processors are being used, each of the processors can access the R, input table simultaneously because each RID in each buffer start at a different memory location in the input table.
  • step 607 the results of each filled buffer are written to an output file.
  • the buffers can be processed sequentially or in parallel depending upon the processor capability.
  • Each output file can be connected by either a pointer or location table in order to create the equivalent of one output file for the Rj records selected by the join index. At this point in the technique, the processing for the R, records is complete.
  • Step 609 starts processing the remaining input tables.
  • the tables can be processed either in succession or at the same time, depending upon the processor availability.
  • the flow chart shows each input table being processed on the same processor, but the flow chart could be easily adapted to accommodate multiple processors.
  • Steps 609 to 621 are performed for each additional input table not in sorted order.
  • the first table operated on is R 2 . Similar operations are performed on all other input tables (including the R, table if not in sequential order) .
  • Step 609 partitions the RIDs from the join index for the next input table to be processed into buffers.
  • the partitioning of available main memory will create buffers whose records will all fit into main memory at one time.
  • the partitioning conditions will give an approximately equal number of records to be allocated to each buffer if all records are the same size.
  • the corresponding join index offset is also saved in the buffer for later processing.
  • Step 611 sorts the contents of each buffer by the R 2 RID values in ascending order so that each input table can be read sequentially from the outside storage medium.
  • the original order of the R 2 RID value is preserved in a temporary file also containing the join index offset preferably located in a file outside of main memory. Alternatively, the temporary file could be part of main memory.
  • Step 613 then sequentially reads in each of the selected R 2 records for each partition in turn and places them in the memory buffer.
  • the buffer is written to a temporary output file in the original order of the R 2 RIDs for that buffer (the order is kept in the temporary files) .
  • all the input table's record have been read and stored in temporary output files but are not in the proper order of the relationship established in the join index which matches the desired R 2 records to the saved R, records.
  • Each of the temporary output files must be merged together based on the original order of the join index offset so that appropriate records will be matched. This will allow the R 2 records values to match up with the stored Rj records via the join index.
  • the temporary output files are then parallel- merged based on the R, RID values in a manner described in Figure 2.
  • the parallel-merge operation uses a min- heap tree operation to determine the minimum join index offset value, reads the corresponding R 2 records, replaces that value with the next RID from the identified buffer, and finds the new minimum join index offset.
  • the min-heap operation described in Fig. 2 is the preferred way to perform the minimum identification function, although any other minimum identification technique known in the art can be used.
  • the results of the parallel-merge operation will be stored in an output file or group of connected files in step 619.
  • the resulting output file will correspond to the R, records in the order indicated in the join index.
  • Step 621 checks if there are any more tables to be processed. If there are, then the technique jumps to step 609 and processes the next table in the same manner as for R 2 . If there are no more tables, then the parallel-join technique is completed.
  • the parallel-join technique is particularly well suited to be adapted for the use of parallel processing because each of the input tables are processed completely independently. Furthermore, some of the operations on a single table could be parallel processed on multiple processors especially where the input tables are very large. This technique is an advantage over prior techniques which would simply join one table to the next, and then join the resultant intermediate table to the next table. In the current parallel-join technique, all the input tables can be processed at the same time.
  • Figure 7 shows three input tables to be joined and a join index which shows the relationship of the records for the join result. The tables in Figure 7 will be used as a common example of using the parallel- join and multi-slam-join techniques for three or more tables.
  • Table 701 includes records of student names and their course numbers.
  • Table 703 includes records of course numbers and the associated instructor(s) for each course and the course title.
  • Table 705 includes records of instructors and their office numbers.
  • the join operation to be performed in this example will create a result table with all the instructors from table 703 who have common course numbers with a student in table 701, and who have an office or offices according to table 705.
  • the join result will contain columns for Student, Course, Instructor, Title and
  • join index 707 showing the relationship between the tables for the join result is either pre ⁇ existing before a parallel-merge based technique is performed or created with one of several possible techniques.
  • One such technique for creating a join index can be easily extrapolated from the description of Figure 12.
  • a nested loop for each input table would be used in the technique.
  • Figure 7 shows first table 701 (R,) including a column of data entitled Student names 701A and a column entitled Course number 701B. Also included with table 701 is a column of sequential record identifiers 701C which shows the order of the records in the example. Column 701C does not in practice appear with table 701, but is solely present for this explanation.
  • Second table 703 (R 2 ) is a table to be joined with table 701 and table 705, and includes a column of data entitled Course number 703A, a column entitled Instructor 703B and a column entitled Title 703C. Also included only for this explanation is a column of sequential record identifiers 703D which shows the order of the records in this example.
  • Third table 705 (R 3 ) is to be joined with table 701 and table 703 and includes columns of data entitled Instructor 705A and Office number 705B. Also included for this explanation is a column of sequential record identifiers 705C which show the order of the records in this example. For this example, all columns except duplicates in the three input tables will be present in the join result.
  • the join result will then comprise an output table having a Student name column, a Course column, an
  • Instructor column Only a portion of the columns present in each input table which are not represented in the join index may be selected to be in the join result.
  • first table 701 may have included a column entitled "Grade" which would not be part of the join result.
  • the join index 707 shows the relationships of the corresponding records for each input table which will be present in the join result.
  • the join index 707 contains a first column 707A of RIDs for first table
  • join index equates the course number of the student in first table 701 with the course number of the instructor in second table 703, and the instructor in second table 703 with the instructor in third table 705. While this example shows the RIDs to correspond to positions in the table, in practice RIDs contain a disk block pointer and an offset value of the record within the block.
  • the join index will be used to read in the input tables sequentially and preserve the required relationships for the join result between the input tables.
  • the columns in the join index do not need to be in any sequential order initially to perform the join operation.
  • the input tables and join index are stored in a mass storage medium.
  • a mass storage medium is defined as any storage medium besides the designated main memory and could be located separate from the computer housing or in the same physical structure which contains main memory.
  • the mass storage medium may be a disk.
  • Figures 8A-8D show a graphical example of joining three tables using the parallel-join technique.
  • the example shown in Figs. 8A-8D is not meant to limit the scope of the invention in any way, but is merely provided to help describe the operation of the parallel-join technique when three input tables are joined.
  • the input tables shown in Figure 7 will be joined using the join index in Figure 7.
  • the intermediate results and output files will be shown and explained.
  • FIG 8A shows the intermediate results after processing the first input table 701.
  • the join index 707 contains the RID values for table 701 in numeric order. While a join index will normally be stored in R ] RID ascending order, the join index can be sorted prior to the join operation if required.
  • Step 601 reads in the RID values from the join index for each of the tables to be joined.
  • Step 603 then partitions the RID values into buffers so that all the records in one buffer can be read into available main memory for the input table at one time.
  • Condition 801C includes all RID values which are less than four.
  • Condition 803C includes all RID values which are greater than or equal to four and less than seven.
  • Condition 805C includes all RID values greater than or equal to seven.
  • the conditions distribute the records from the first input table with their RID values into each buffer. The distribution is not exactly equal because some RIDs are repeated in the join index for the first input table
  • Buffer 801 contains a record portion 801B and a
  • Buffer 803 contains a record portion 803B and a RID portion 803A.
  • Buffer 805 contains a record portion 805B and a RID portion 805A.
  • the technique then reads in the records from the first input table 701 and stores the records temporarily in the appropriate buffer based on the fulfillment of the buffer conditions. Since the join index was originally in numerical order for the first input table 701, the R, RIDs for each partition will also be in numerical order. Therefore, the records can be read in sequential order without any further sorting. If a RID is repeated in a particular buffer, the record will only be read once from the first input table 701 and a duplicate record will be stored in the buffer corresponding to any other RIDs. Step 605 of the technique described in Fig.
  • Buffer 801 will contain three RIDs ' and three records ((1,Smith) , (3,Jones) and (3,Jones)) , where the Jones records was only read once and copied to the third record in buffer 801. After a buffer is filled with all the records which fulfill its condition, it is written to an outside output file (corresponding to step 607 in Fig. 6) .
  • the final output file 807 will be one continuous file containing all the selected records identified in the join index after the parallel-join operation is complete.
  • the positional values 808 in the output file 817 are for this explanation only and are not included in the output file itself. Alternatively, a location table can be formed for the same purpose.
  • FIG. 6 shows the intermediate results of processing the second input table 703.
  • the steps in Fig. 6 corresponding to the processing of second input table 703 are steps 609 through 621.
  • the processing for the second input table 703 can be performed completely independently of any other input table. If the parallel-join operation is implemented on independent processors, the join index may have to be read for each processor.
  • the separate processing allows the parallel-merge technique to be well suited for parallel processing applications. However, it is not a requirement to parallel process in order to achieve efficient results.
  • Step 609 partitions the records to be included in the join index from the second input table 703 into three separate buffers, each of which will fit into available main memory alone when the records corresponding to the RIDs from the join index are read into available main memory.
  • Buffer 811 contains records column 811C, second table 703 RID values column 811B, and a join index offset corresponding to first table 701 RID values column 811A.
  • the R j RID values are stored in order to preserve the relationship in the join index between each of the input tables.
  • the buffer condition 811D includes all RIDs with a value less than three.
  • Buffer 813 contains record column 813C, second table RID column 813B, a join index offset corresponding to first table RID column 813A, and buffer condition 813D which includes all RID values greater than or equal to three and less than five.
  • Buffer 815 contains record column 815C, second table RID column 815B, a join index offset corresponding to first table RID column 815A, and buffer condition 815D which includes all RID values greater than five.
  • the three buffer conditions distribute approximately equally the records among the buffers. Buffer 811 is allocated three records, buffer 813 is allocated three records and buffer 815 is allocated the remaining two records.
  • Each of the second table RID columns are stored in an outside file to retain the original order
  • the second table RID values can be stored in temporary files in main memory. Also during step 611, the second table RID values are sorted in order to allow the R 2 records to be read sequentially from an outside storage medium into available main memory.
  • the technique has read all the records identified in a buffer's second table RID column, the contents of the buffer are stored in an output temporary file. The temporary file is then reordered by sorting according to the according to the R, RID values.
  • the R, RID columns are 811A, 813A and 815A. Buffers 811, 813 and 815 in Fig. 8B are shown at this stage in the technique, after all the records are read and the records and RID values are placed in their original R, RID order.
  • the three buffers are then parallel-merged using the min-heap technique explained with Figure 5 based on the values of the join index offset columns 811A, 813A and 815A (using the positioned columns as the merge files in Fig. 5) .
  • the join index offset values are placed in a min-heap tree form to identify the lowest value.
  • the corresponding record is then written from its buffer to its final output file 817.
  • the final output file will be one continuous file or group of files containing the records associated with the positional values in each buffer for the second input table.
  • the positional values 818 in the output file 817 are for explanation purposes only and are not included in the output file itself.
  • Figure 8C shows the intermediate results for the processing of third table 705. Steps 609 through 621 of Fig. 6 are repeated for this third input table.
  • Three buffers are formed and are shown as buffers 821, 823 and 825.
  • Buffer 821 contains record column 821C, third table RID column 821B, join index offset column
  • Buffer 821 contains record column 823C, third table RID column 823B, join index offset column 823A, and buffer condition 823D which includes all RID values greater than or equal to three and less than five.
  • Buffer 825 contains record column 825C, third table RID column 825B, join index offset column 825A, and buffer condition 825D which includes all RID values greater than or equal to five. The three buffer conditions distribute approximately equally the records among the buffers. In this example, buffer 821 is allocated three records, buffer 823 is allocated two records and buffer 825 is allocated three records.
  • the join index offsets and R 3 RIDs are then saved in their original order for later use (corresponds to step 611) .
  • Each buffer is then sorted according to the value of the third table RID values.
  • the sorted RID values allow for sequential reading of the third input table records.
  • the technique then reads in the third input table 707 records for each RID value in each buffer. In this example, each of the buffers will be read in turn starting with the lowest RID values. However, multiple processors could be used to process each partition simultaneously.
  • the records are stored in their original R 3 RID positional number order based on the sorted R, RID order.
  • the partitions in Fig. 8C are shown in their original positional number order. This ensures that the third input table 707 records will correspond to the other records indicated in the join index.
  • the three partition temporary output files are then parallel-merged according to the join index offset preferably using the min-heap technique described with Figure 2.
  • Other techniques well known in the art for identifying the lowest value of multiple values could now be used.
  • the corresponding record is written to the final output file 830 (corresponding to step 619 of Fig. 6) .
  • Output file 830 will contain all the third input table 707 records identified in the join index situated in the order of the join index.
  • the positional numbers 831 are present in file 830 only for explanation purposes and do not appear in the actual file.
  • the output files 807, 817 and 830 are vertically fragmented in that each file is stored separately from the others. They are shown together in Fig. 8D. This allows each input table to be processed separately which reduces the number of times the input records are read and written to output files and allows for better use of parallel processing.
  • the output files are in the exact order of the join index for the parallel-join technique.
  • the second technique for processing three or more input tables will now be described and is called the multi-slam-join technique.
  • This technique uses elements of the slam-join technique described in Fig. 1 with elements of a second technique for joining tables using a join index. That second technique, called jive-join, is described fully in the parent U.S. application No. 08/531,789.
  • FIG 9 is a flow chart showing the steps performed in the multi-slam-join technique for joining multiple tables.
  • the tables will be designated Ri where i is the table number and there are N tables to be joined.
  • Step 901 allocates the available main memory used by the processor into partitions called buffers and assigns conditions for each buffer for tables R 2 to R N _ ! (designated the "intermediate tables"), such that all the unique records fulfilling each buffer condition for each intermediate input table will fit into available main memory at one time later in the technique.
  • the R records will be allocated according to which pass they were read into memory.
  • the last table to be joined will not participate in the initial buffer allocation.
  • the buffers themselves will only hold a portion of the selected records as explained below and all the allocated buffers will exist simultaneously in available main memory in the initial steps of the technique.
  • Main memory is a relatively fast access memory associated with a central processor and is preferably RAM. Available main memory is that portion which is not allocated to another function or to another user in a multi-user computer system.
  • the number of buffers is determined by the total number and size of unique selected records to be read from each input table except the last input table. That number can be easily calculated from the join index.
  • the number of buffers is not related to the records of the last input table.
  • the number of buffers added by processing R will be determined during the join operation. Because the preferred number of input tables is three or more for the multi-slam-join technique, the buffers will be cross-allocated with respect to each input table's records.
  • the number of buffers created will then be the product of the relative size of the records to be read from each input table (except for the last table) to the size of main memory.
  • the records for the first table are not initially part of the buffer allocation, although buffers will be created using the multi-slam- join technique by reading in all the records and RIDs that will fit into available main memory.
  • a group of buffers will be formed called segments (further shown in Fig. 10).
  • a segment consists of all the records which will fulfill one condition for an input table, in this case all the records read in one pass.
  • Each buffer is then assigned conditions for each input table based on the value of the record- identifiers (RIDs) in the join index for tables other than the first and last table.
  • the conditions for the first table are established when the first table records are read into available main memory. There are no conditions for the last input table.
  • the conditions ensure that each group of buffers for a given condition (also called a "segment") in the multi-join (three or more table being joined) case will receive a substantially equal number of participating records from each of the input tables in relation to the other conditions for that particular input table.
  • the conditions are chosen so that the total size of the unique records for a segment fit in main memory.
  • each segment corresponding to that condition will be allocated half the second input table records.
  • the allocation process and the multidimensional buffering will be explained in depth with respect to Figure 10. Examples of performing these allocations are shown in the discussion of Figures 11A-D.
  • Step 903 reads a portion of the RIDs in the join index sequentially into available main memory.
  • the join index identifies related records between all the input tables which will be included in the join result.
  • the join index indicates the record's RID (the RID can be a record's physical or virtual location) of the related records in their tables.
  • the RID combination of (1,3,5) indicates that the first record of the first table is related in the join result to the third record of the second table and to the fifth record of the third table.
  • Step 905 reads disk blocks containing the records from the first table R, selected by the memory-resident portion of the join index into main memory sequentially, and only reads disk blocks containing records present in the join index and therefore needed for the join result. If a record in R, is not in the join index, that record should not be present in the join result, and a disk block is therefore not read into main memory if the records contained therein are not selected records.
  • the records from R are always in sequential order relative to the join index because the join index was ordered based on R,. Sequential and discriminate access of disk blocks helps reduce I/O operations and time costs due to seek/rotational latencies.
  • the R, records and identifiers are read into available main memory until the memory is full.
  • Step 906 sorts the R, partition and join index partition in R N RID order.
  • Step 907 stores the selected records for R, corresponding to the join index in one of the buffers based on the previously calculated conditions for each intermediate input tables R 2 to R N .
  • the join index entries for R 2 ..R N are written to temporary files associated with each buffer.
  • the R : records and R 2 ... R N RIDs are written in R N RID order. For example, if there are four input tables, the records of R, would be categorized based upon their value of the R 2 and R 3 RIDs in the join index. The R 2 , R 3 and R 4 RIDs would be written to temporary files associated with the designated buffer.
  • the data is "flushed" or written to a mass storage disk at a predefined location.
  • the data is organized by individual buffer. The data will be written to contiguous regions of related data in the mass storage disk.
  • mass storage disk again refers to any storage medium which is not RAM.
  • step 903 the technique jumps to step 903 and starts another pass with a new segment for R,. If the join index has been completely read and processed, then the technique goes to step 911.
  • the processing of the R ] records of the join result will be complete and stored in a number of output files equal to the number of buffers.
  • the buffers themselves will now be empty.
  • the buffers can then be unallocated from available main memory and are not required in the remaining steps of the algorithm.
  • the R, output files are then logically linked together by pointers from the end of one file to the beginning of another. Alternate ways of linking files can also be used such as a conventional location table identifying the start of each output file.
  • the order of processing the intermediate tables does not have an effect on the efficiency of the multi-slam-join technique.
  • Steps 911 through 918 will be used to process intermediate tables R 2 through R N .
  • the last table R N will be processed using elements of the slam-join technique.
  • Step 911 reads in turn the temporary files corresponding to a segment, which is a group of temporary files associated with one condition for an input table.
  • the temporary files contain record identifiers from the join index for a given input table.
  • the temporary files are read into memory and each segment (associated group of temporary files) is sorted by the appropriate RID for the given input table.
  • the original temporary files in the segment are preserved for later use.
  • the sort routine places the appropriate RIDs from the temporary file in numerical order.
  • the RIDs can be sorted with any known technique, such as a quick sort routine which is well known in the art.
  • Multiple temporary files associated with one condition for one input table are read together and a single combined sorted version is created (see Fig. 10 and corresponding explanation) .
  • Output files corresponding to each buffer are created on a mass storage medium. For each input table number i, where i is 2 to N-l, R, is then read sequentially into main memory reading only RIDs in the sorted temporary file in step 913. The records of R, are then written to the corresponding output file in the order of the original temporary file(s) in step 915.
  • the original temporary file(s) preserve the order of each R, record RID in the join to produce a consistent ordering of the desired join result.
  • the output files for R are then joined together by pointers connecting the end of each file with the beginning of the next. Thus the join result is vertically fragmented.
  • Steps 916 checks if there are remaining segments (corresponding to a group buffers for R,) which have not been processed. If there are more segments for R lf then the technique jumps to step 911. If all the segments (and thus corresponding buffers) have been processed for R, then the technique continues with step 917. Step 917 checks if there are any more tables to be joined which have not yet been read except for the last input table. If there are no more tables, then the technique jumps to step 920 to process the last remaining input table. If other tables remain to be joined, then i is incremented in step 918 and the temporary files for the next intermediate table are sorted based on the RIDs for that table in the join index and steps 911 through 916 are repeated.
  • Step 922 then performs a parallel-merge operation on all of the sorted temporary files in order to identify the lowest R N RID value to be read from the last input table.
  • the parallel-merge technique used is the min-heap technique described with Fig. 2 or can be another minimum identification technique well known in the art.
  • Figure ll shows an example of the operation extended to multi-slam-join.
  • the parallel-merge operation allows the R N records to be read in sequentially from the lowest value to highest instead of randomly thereby saving extra I/O operations.
  • the R N records which were read are then written to the appropriate output file.
  • the output file for each of the buffers for the R N records are then connected together by either pointers or a location table identifying the memory location of the output files.
  • the order of connection of the output files is the same as the connection order for the output files associated with the other input tables.
  • the final join result will be vertically fragmented for each joined table in that each input table will have a separate output group of files for the join results for that table. Vertical fragmentation allows each record of the input tables to be written out from main memory only once which minimizes the I/O operations.
  • Figure 10 shows a buffering scheme for joining three input tables (R,, R 2 and R 3 ) using the multi-slam- join technique and is displayed as a grid 1000 contain ⁇ ing 20 buffers designated Bl through B20.
  • the records of the first input table and their corresponding R 2 and R 3 RIDs from the join index will be partitioned by the R 2 RIDs.
  • the R, records are grouped according to segments which contain records which are read into available main memory in a single pass.
  • the R 3 records are not conditioned as part of the allocation scheme and are only processed during the final parallel-merge operation in this technique.
  • Along the top of grid 1000 are the determined conditions for sorting the record RIDs of R 2 .
  • the selected conditions in this example are used to help explain the buffering scheme.
  • Condition 1001 is met when the R 2 RID is less than 20.
  • Condition 1003 is met when the R 2 RID is greater than or equal to 20 and less than 45.
  • Condition 1005 is met when the R 2 RID is greater than or equal to 45 and less than 70.
  • Condition 1007 is met when the R 2 RID is greater than or equal to 70.
  • the segments 1010, 1012, 1014, 1016 and 1018 listed on the vertical side of grid 1000 correspond to each of the passes that were required in order to read in all of the Rj records. In this example, five R, segments were required because the selected records from the first input table were approximately five times the size of available main memory. A segment is not allocated by RID condition for the vertical side of grid 100. This is in contrast to the jive-join technique applied to three or more input tables described in the parent application which required allocating conditions for N-l tables when joining N tables in one operation.
  • a join index entry with a value of (1,25,83) being processed in the multi-slam-join technique in this example would read the first record of Rj in the first pass and write it to buffer B2 based upon the values of the R 2 RID in which pass the R, record was read.
  • the first R, record (with an RID of 1) is read in the first pass for this example so will be part of the first segment 1010.
  • the R 2 RID is 25 which fulfills condition 1003 for R 2 by being between 20 and 45.
  • the R 3 RID has no effect on which buffer the R, record and associated temporary files are placed.
  • Each buffer B1-B20 will also have two associated temporary files in which the join index entries for R 2 and R 3 for the selected R, record will be separately stored.
  • a second example is an index value of (50,4,119) which will store the second record of R, in buffer B9 based upon satisfying condition 1001 for R 2 and being in the third pass of R, records being read in this example.
  • Exactly which segment a R, record will be placed is based on the number of records being read and the size of available main memory.
  • the R, records are not placed in a new segment until a previous segment is filled and no available main memory is free to read in more records.
  • a number of temporary files will by read together as a segment, where a segment is defined as all the temporary files related to one condition for a single input table.
  • the temporary files for R 2 associated with Bl, B5, B9, B13, and B17 will be processed together for condition 1001 as one segment which is read in, sorted, and used to read selected R 2 records.
  • segments for R are the R 2 temporary files corresponding to (B5, B6, B7, B8) and (B17, B18, B19, B20) .
  • the R 2 temporary files for each column of buffers on grid 1000 will be read together as one segment to be sorted and used to sequentially read in selected records of R 2 .
  • the R 2 records will then be written to output file(s) based upon the original order of each temporary file associated with R 2 .
  • the R 3 RIDs stored in their temporary files associated with each buffer will be processed using the parallel-merge operation. All of the temporary files which are ordered by the R N RID will be parallel-merged such that the lowest remaining RID will be identified and the corresponding R 3 record will be read and placed in an output file associated with the temporary file (and original buffer) . The order is preserved so that the R 3 records will correctly correspond to the records from the other tables in the manner set out in the join index.
  • the parallel-merging step allows the reduction of calculating allocations and conditions for one less input table from the jive-join technique described in the parent application. Because the applications of these techniques are on very large tables, the reduction in the allocation process is potentially a large savings of processing time. Another benefit is the avoidance of skew which saves processing time.
  • Each of the three input tables are all read sequentially form their output files reducing I/O operations and allowing the input records to be read only once and written only once to the output files.
  • the buffering scheme shown in Fig. 10 can be extended to the case of N tables to be joined by allocating buffers with conditions for each of the input tables except the first and last tables. Thus, when five relations are to be joined, a four dimensional buffer will be allocated.
  • the conditions on RIDs for the three intermediate input tables will be initially calculated for the RIDs of R 2 , R 3 , and R 4 .
  • the remaining dimension of the grid will derive its proper segmentation from the process of reading in the R, records in a number of passes.
  • FIG. 11A-11D show a graphical example of joining three tables using the multi-slam-join technique. The example shown in Figs. 11A-11D is not meant to limit the scope of the invention in any way, but is merely provided to help describe the operation of the multi-slam-join technique when three input tables are joined.
  • Figure 7 shows three input tables to be joined and a join index which shows the relationship of the records for the join result.
  • Table 701 includes records of student names and their course numbers.
  • Table 703 includes records of course numbers and the associated instructor(s) for each course and the course title.
  • Table 705 includes records of instructors and their office numbers.
  • the join operation to be performed in this example will create a result table with all the instructors from table 703 who have common course numbers with a student in table 701, and who have an office or offices according to table 705.
  • the join result will contain columns for Student, Course, Instructor, Title and Office.
  • the join index 707 showing the relationship between the tables for the join result is either pre-existing before the multi- slam-join technique is performed or created with one of several possible techniques.
  • One such technique for creating a join index can be easily extrapolated from the description of Figure 12.
  • a nested loop for each input table would be used in the technique.
  • the input tables are described in detail in connection with Fig. 7.
  • Figure IIA shows a portion of the intermediate results and output files of the multi-slam-join technique applied to the input tables in Fig. 7 to help describe the operation of the technique.
  • the example shown in Figs. IIA - 11D are of the R, records ordered by the R N RIDs as sorted in step 906 of Fig. 9.
  • First, buffers, or memory partitions, are created from the available main memory. The number of buffers is based upon the size of all the unique records indicated in the join index to be joined except for the last table compared to the size of available main memory. The exact calculation for the number of buffers is described in more detail below.
  • the buffer partitions will be allocated according to conditions on second table 1103 RIDs (not the first and last input tables) and the conditions are determined to equally distribute the selected records from the second input table between the segments (groups of related buffers) .
  • each selected record will be the same size.
  • the first table 701 RIDs, the second table 703 RIDs and third table 705 RIDs all appear in the join index 707 which indicates the matching records between the input tables to be included in the join result.
  • the number of partitions is determined to be two for second table 703.
  • the defined condition will equally distribute the R 2 records among its two partitions so that each segment will fit in main memory.
  • the number of segments to be created for the Rj records is based on the number of R, records read and the size of available main memory. Once available main memory is filled to capacity with R, records which matches up a segment, the R, records and corresponding join index entries are sorted by the R 3 RID order and then written to the various buffers. Then another R, segment is created to be filled.
  • the buffers are designated as BUI 1109, BU2 llll, BU3 1113 and BU4 1115.
  • the condition for the first buffer BUI 1109 is defined as the second table 1103 RID is less than or equal to three (shown as condition 1117A) and the R, record is read in the first R, segment (i.e., the first pass) .
  • the condition for the second buffer BU2 llll is defined as the second table 1103 RID is greater than three (shown as condition 1117B) and the R, record is read in the first R, segment.
  • the condition for the third buffer BU3 1113 is defined as the second table 1103 RID is less than or equal to three (shown as condition 1117A) and the R record is read in the second R, segment (i.e., in the second pass) .
  • the buffers are allocated so that the aggregate size of the participating R, and R 2 segments -will be close to but not greater than the size of available main memory, and a segment is defined as all the buffers related to one R 2 condition or one memory- sized partition of R,. In practice, the allocation of the Rj segments does not occur until the R] records are being read.
  • the buffer grid in Fig. IIA is used for a conceptual explanation of the multi-slam-join technique.
  • the next step in the multi-slam-join technique is to read a portion of the join index 707 into the available main memory and to read data blocks containing selected records from the first table 701 only if a record has a corresponding entry in the join index.
  • the join index and the selected records are read concurrently by reading in data blocks containing selected table 701 records which correspond to the join index portion read. During this process of reading in the R, records, the R, segments will be created. After the portion of selected R, records are processed, the next portion of the join index and selected records are read in until the entire join index and selected R, records are processed.
  • the total number of passes (or chunks of Rj records and identifiers which can be read into available main memory at one time) required in this example is two.
  • each selected record is stored in the appropriate buffer based upon the corresponding second table 703 RID and current R[ segment and the conditions associated with each buffer.
  • first table 701 Smith - 101
  • second table 703 RID is one which fulfills the condition 1117A
  • the R] record is being read in the first pass since the record read is the first R, record from the first input table 701.
  • the intersection of the fulfilled condition and current R, segment is the first buffer BUI 1109.
  • Another example is the third record in first table 1101 (Jones - 104) and its corresponding join index entry (3,4,3) which is read into memory and stored in the second buffer BU2 llll because the corresponding second table 1103 RID record is four, fulfilling the condition 1117B, and the correct R, segment is the first segment 1119B.
  • the same R 2 record (Jones - 104) with the join index entry (3,4,4) will also be placed in BU2 llll with a different associated temporary files for the R 3 RIDs.
  • First table 701 is read sequentially from a mass storage disk but a particular data block will only be read into main memory if an entry for a record on that block is included in the join index.
  • the ninth record of first table 701 (Frick - 107) will not be read into main memory because there is no corresponding RID entry in the join index 707 and therefore an I/O operation can be saved.
  • the R, records and join index partition are sorted in R N RID order in main memory. In this example, the two records in first buffer 1109 are reordered and appear in the sorted order.
  • each selected record of first table 701 is written to one of the buffers
  • the corresponding second table 703 RID and third table 705 RID are written to independent temporary files for each input table which are associated with each buffer.
  • the RIDs in each temporary file are stored in the order of the selected last table 705 RIDs.
  • the temporary files 1121 for second table 703 RIDs and temporary file 1122 for third table 705 RIDs are associated with the first buffer 1109; the temporary files 1123 for second table 703
  • RIDs and temporary file 1124 for third table 705 RIDs are associated with the second buffer llll; the temporary files 1125 for second table 703 RIDs and temporary file 1126 for third table 705 RIDs are associated with the third buffer 1113; and the temporary files 1127 for second table 703 RIDs and temporary file 1128 for third table 705 RIDs are associated with the fourth buffer 1115.
  • Temporary files 1121, 1123, 1125 and 1127 as a group contain all the second table 703 RIDs which correspond to the buffers.
  • Temporary files 1122, 1124, 1126 and 1128 as a group contain all the third table 705 RIDs which correspond to the buffers.
  • the selected records from first table 701 After the selected records from first table 701 are read for every Rj RID in the join index, the selected records of first table 701 which are to be included in the join result will have been stored in output files 1130, 1132, 1134 and 1136 after being flushed from available main memory.
  • the output files will then be connected in a particular order. The order is arbitrary as long as all other output files for the other input tables are connected in the same sequence.
  • a pointer at the end of output file 1130 is directed to the start of output file 1132.
  • a pointer from the end of output file 1132 is directed to the beginning of output file 1134.
  • a pointer at the end of output file 1134 is directed to the start of output file 1136.
  • the pointers connect the four output files to be one complete output file for the selected first table records. Alternately, other techniques could be used to link the files.
  • the processing on the first table 701 is complete and the results for the first table 701 are stored in its output files 1130, 1132, 1134 and 1136 which are separate from the data related to other input tables in the join result.
  • Figure IIB shows a portion of the intermediate results of the next steps in the multi-slam-join technique.
  • the multi-slam-join technique continues by reading the temporary files 1121, 1123, 1125 and 1127 corresponding to second table 703 shown in Fig. 7.
  • the temporary files for each input table except the last are read as segments which include all temporary files that satisfy a given condition for that table. For example, there will be two segments for second table 1103.
  • the first includes all records with a second table 1103 RID less than or equal to three (condition 1117A) , which includes temporary files 1121 and 1125.
  • the second segment for second table 703 will include all records with a second table 703 RID greater than three, which includes temporary files 1123 and 1127.
  • Each segment for the second table will therefore include one entire column of temporary files of R 2 RIDs in the representation of the buffers for second table 703.
  • a segment including temporary files 1121 and 1125 is read and sorted in ascending order of the second table 703 RIDs.
  • the duplicate RID values are discarded and the results of the sort are stored in RAM (see file 1138) .
  • the original temporary files 1121 and 1125 are retained.
  • the data blocks containing selected records from second table 703 are then read sequentially into main memory based upon the sorted second table 703 RIDs stored in file 1138.
  • the selected records which are read in this example appear in box 1140. Because the second table 703 RID values have been previously partitioned based on defined conditions 1117A and 1117B, all the RIDs relating to records located in the lowest half of the disk locations for second table 703 located in the mass storage medium are read when the first segment is processed.
  • the selected records are then written in the order of the associated unsorted temporary files to output files to ensure that the corresponding records of first table 701 and second table 703 will be stored in the correct order in their output files.
  • the records from second table 703 must be placed in the unsorted order to preserve their relationship with the first input table and other input tables. If the unsorted temporary file lists a second table 703 RID multiple times, such as where multiple records from the first table are associated with one record from the second table, the selected second table 703 record will be written to the output file in all the places identified by the second table 703 RID.
  • a second segment for the second table 703 made of temporary files 1123 and 1127 is read, sorted by the second table 703 RIDs and duplicate values eliminated; the results of which appear in file 1144 located in RAM (main memory) .
  • the data blocks containing records of second table 703 are now read sequentially starting at the lowest value of second table 703 RID in the segment (the segment includes buffers BU2 llll and BU4 1115) and stored in main memory.
  • the selected records which are read appear in box 1141.
  • the selected records are then written to the output file in the order of the unsorted temporary files 1123 and 1127 associated with the segment being processed.
  • the first table 701's join result files 1130, 1132, 1134 and 1136 sequentially correspond with the second table 703's join result output files 1146, 1148, 1150 and 1152, respectively, and would form a table if the output columns were concatenated.
  • the second record in output file 1130 corresponds to the second record in output file 1146 and the same is true throughout the output files.
  • Fig. IIC shows the intermediate results of processing third table 705, the last table to be processed.
  • the third table 705 RID values have not been partitioned and will be proceed using the parallel-merge technique described with Figure 2.
  • Each of the temporary files for the third table 705 RID's 1122, 1124, 1126 and 1128 are already sorted in numerical RID order for each temporary file.
  • the temporary files are shown in Fig. IIC collectively as file group 1101.
  • the individual file 1109 corresponds to temporary file 1122
  • file 1110 corresponds to temporary file 1124
  • file llll corresponds to temporary file 1126
  • file 1112 corresponds to temporary file 1128.
  • the RIDs from the temporary files would be placed in merge files in available main memory as described with Figure 2. In the example, only the temporary files will be referred to.
  • the lowest value RID from each individual file is then placed in a min- heap structure used to efficiently identify the lowest value RIDs in all temporary files.
  • the initial min- heap tree structure which was described in detail in
  • Figs. 2 and 5 is shown applied to this example as min- heap tree 1102.
  • the min-heap tree 1102 contains nodes corresponding to the lowest value in each of the sorted temporary files. If a temporary file contains no RID values, then there will be no corresponding node in tree 1102. In this example, node 1161 corresponds to temporary file 1128, node 1162 corresponds to temporary file 1109, node 1163 corresponds to temporary file llll and node 1164 corresponds to temporary file 1110.
  • the identified lowest RID value is then read from third table 705 from an outside disk into available main memory and placed in a buffer (allocated memory partition) associated with the temporary file whose RID value was identified as the smallest.
  • the identified minimum node is then removed from the min-heap tree structure and replaced with the next value in the sorted temporary file of the just identified minimum RIDs, if any.
  • the tree structure is then reorganized to find the next minimum value. The process continues in this manner until all the RID values which correspond to a third table 705 record have been processed and all the identified records have been read.
  • the buffer will be "flushed" to an outside disk. The order of the records is important because the third table 705 records must correspond to the other two input tables in accordance with the join index.
  • output files for the buffers corresponding to the third table 705 temporary files are shown in file group 1103. Specifically, output file 1169 corresponds to temporary file 1122, output file 1171 corresponds to temporary file 1124, output file 1173 corresponds to temporary file 1126 and output file 1175 corresponds to temporary file 1128.
  • the output files are connected in the same manner as the output files for the other input tables.
  • a pointer is located at the end of output file 1169 directed to the beginning of the output file 1171. Pointers connecting the other files create the equivalent of a single file for the join result of third table 705.
  • the first table 701 and second table 703 join result files sequentially correspond with the third table 705 join result files and would form a table consistent with the join index if the output columns were concatenated.
  • first table 701, second table 703 and third table 705 are stored separately to enable each input table to be processed separately. These separated files are vertical fragments of the join result. Vertical fragmentation allows a reduced number of I/O transfers by only once reading and writing first table 701, second table 703 and third table 705 and not requiring other I/O accesses to construct a join result.
  • Join result tables which are vertically fragmented by columns can be easily combined for random access when generating an index for the join result. For sequential processing of the join result, there is only an insignificant overhead of having several vertical partitions compared with a non-partitioned representation.
  • Fig. 11D The join result of applying the multi-slam-join technique to the three input tables and join index in Fig. 7 is shown in Fig. 11D which consists of the vertically fragmented output files 1180, 1181 and 1182 where file 1180 contains the join results for the first table, file 1181 contains the join result for the second table and file 1182 contains the join result for the third table.
  • Each of these output files are made up of their four individual output files logically connected by pointers or connected using an alternate technique.
  • the first table to be joined in the multi-slam- join technique should preferably be the smallest table because the number of segments for the first table is defined by how many passes are necessary to read in all of the first table records identified in the join index.
  • Minimizing the number of first table segments is a efficient way to minimize buffer allocations and temporary file manipulations.
  • the last table should preferably be the largest input table because the buffering scheme does not rely on the size of the records for the last table.
  • the last table to be joined avoids the need for an allocation step and condition generation for that last table.
  • the buffer allocation allows the multi-slam- join technique to be more efficient in processing input tables because of reduced I/O operations than if input tables were simply joined together one at a time.
  • the present technique allows for fewer I/O operations for each of the input tables because the buffering structure allows each input table to be read sequentially and the output is vertically fragmented.
  • Figs. 12A-12C graphically illustrate the steps for generating a join index for two input tables if a join index is not present.
  • a join index of some type is required to implement the slam-join, multi-slam-join, or parallel-join techniques.
  • FIG. 12A shows two tables from which a join index is to be formed by matching the course numbers.
  • Table 1201 contains four records of course numbers and student names.
  • Table 1203 contains six records of course numbers and instructors. The sequential numbers next to the tables indicate the record identifiers.
  • each record of table 1201 in the Course number column will be compared against each record of table 1203 in its Course number column, and if a match occurs, the relationship will be saved in the join index.
  • Figure 12B shows a flow chart of the method for generating the join index using the nested-loop technique.
  • Step 1205 sets a record indicator for table 1201 and a record indicator for table 1203 both to predefined location of the first record in their respective tables.
  • Step 1207 reads in a record of table 1201 which is the first record of the first table 1201 in the beginning of this process.
  • Step 1209 reads in a record of table 1203, which is the first record of table 1203 in the beginning of this process.
  • step 1211 the records read in steps 1207 and 1209 are compared. If the records are equivalent and match, the record identifiers of the two records are written to the join index.
  • Step 1213 checks if there are any more records from table 1203 which have not been read for the current table 1201 record. If there are no more records, the process goes to step 1215. If there are more records in table 1203, then the process jumps to step 1209 to read the next record from table 1203. Step 1215 checks if there are any more records from table 1201 which have not yet been read. If there are more records, table 1203 is reset to the beginning in step 1219. The reset operation can be performed by moving a pointer back to the first record in table 1203. The process then jumps to step 1207 and reads the next record from table 1201. If in step 1215, there are no more records to be read in table 1201, the process is finished. Thus, each record of table 1201 is compared to each record of table 1203.
  • join index a table of record identifiers showing the pairs that would be joined if a join operation was performed using the chosen join column.
  • join index 1230 The join index produced for the input tables in Fig. 12A for matching course numbers is shown in Fig. 12C as join index 1230.
  • Figure 13 is a graphical representation of tables stored in an object-oriented database.
  • the slam-join, multi-slam-join and parallel-join techniques are applicable to object-oriented databases.
  • the previous examples described with Figures 4, 5, 8 and 11 correspond to relational databases where data is stored and indexed relative to other related data.
  • record identifiers are kept within other records to identify the relationships of data.
  • Data related in subject matter are kept as groups of data, which is an equivalent to a table of data in a relational database. If a separate join index relating one group of data to another is present in the database, then the slam-join, multi-slam-join and parallel-join techniques work the same way as described previously.
  • Figure 13 shows a variation of a join index that may occur in object-oriented databases.
  • the join index becomes part of the first group of data as an additional column in the group and contains pointers to a second group of data from which a join result will be formed.
  • first table 1301 contains a first data column 1303 of Student names and a second data column 1307 of Index entries pointing to second table 1309 which are associated with the records of table 1301.
  • the second table 1309 contains a first data column 1311 of Course numbers and a second data column 1313 of Instructors.
  • the desired join result is a table including the student name, course number, and instructor.
  • the index numbers in index 1307 point to the related entry in table 1309 which has a matching course number and one or more instructors.
  • the slam-join, multi-slam-join and parallel-join techniques are well suited to object-oriented databases because the join index does not need to be read in first in order to allocate the available main memory as in the jive-join technique described in the parent application.
  • the parallel-merge type techniques as many records from table 1301 with their pointers as will fit in available main memory will be read at one time and then sorted by the embedded RIDs for the second table. The RIDs for second table 1309 will then be parallel-merged in the manner described in Fig. 2.
  • the partitions for the first input table are created by the number of passes required to read in the first input table and therefore the operation of performing the slam-join operations in an object-oriented system should be the same as in a relational database system.
  • the multi-slam-join technique will require slight modification in the object-oriented system because the middle input tables (not the first and last input tables) will require partitioning prior to the join operation.
  • the parallel-join technique requires the same modification.
  • a join index could be constructed for the middle input tables. However, in the object- oriented world, it may not be desirable to read in all of the middle input tables in order to obtain the join index relationship. Instead the buffers can be divided evenly to approximate the output distribution.
  • Figure 14 is a graphical representation of a local selection which can be incorporated into the slam-join, multi-slam-join or parallel-join techniques.
  • a local selection chooses a subset of a table to be in a join result.
  • Table 1401 is an example of a table upon which a local selection can be performed.
  • Local index 1403 is a table with the RIDs of the records which satisfy the local selection condition.
  • Symbol 1402 indicates a local selection will be performed on table 1401 using local index 1403.
  • Table 1405 is the selection result of table 1401 with index table 1403. Only the first, third and fifth record are present in the result.
  • a local selection can be performed before the operation of the slam-join, multi-slam-join or parallel join techniques to create a smaller input table to be processed.
  • the join index would thus be built on the smaller input table.
  • a local selection can be performed after the join result in order to locally select the desired records in the entire join result. If the join index already exists, the local selection could be performed on the join index itself, instead of on the input or output tables.
  • the join index is often smaller than the input table and it would take less processing time to locally select the join index.
  • the local selection of the join index would select particular RIDs to be part of the join result. If the join index had not yet been formed, the local selection could also be a part of the formation of the join index.
  • An additional way to perform an efficient local selection on the first table with the slam-join and multi-slam-join techniques is to check the first-table record against a local condition when the technique reads in records from the first table records. If the local condition is not satisfied, the record will be discarded.
  • a local condition is all student names starting with the letter "K" in a student data group.
  • Another example of a local condition is all offices on the fifth floor or above in a instructor's office data group.
  • Figure 15 is a graphical representation of a hybrid join index which can be used with the slam-join, multi-slam join and parallel-join techniques.
  • table 1501 represents a large input table with many columns (data A contains multiple columns) which is to be joined with tables 1503 and 1505.
  • Table 1505 is also a very large table with many columns (data B contains multiple columns) .
  • Table 1503 is a smaller table with a second column whose size (in bytes) is relatively small. In this example, table 1503 has one column of data that is not present in the other tables.
  • the hybrid join index 1507 When the hybrid join index 1507 is created, the actual data stored in table 1503 will be stored in the second column of join index, rather than its RID value. To create a hybrid join index if one does not already exist, a technique similar to that described with Figure 12 could be used. Because the data is small in table 1503, the join index will not become too large in size to create memory management problems. When a large table 1501 or 1505 is now processed, the join index column for table 1503 will simply be appended to the records of one of those tables. Thus, if the data was appended to table 1501, the output file for table 1501 would appear as output file 1509 which includes the data from table 1503. Output file 1511 would be the vertically fragmented output for table 1505.
  • join index By using a hybrid join index, I/O operations are saved for processing the data from the input table stored in the join index.
  • the slam-join, multi-slam- join and parallel-join techniques could test the first entries in the join index for their data type or read parameter data stored in a predetermined location to determine if a hybrid join index exists.
  • Figure 16 shows a hardware configuration upon which the slam-join, multi-slam-join and parallel-join techniques can be implemented.
  • Computer 1601 includes a central processing unit (CPU) 1603, a main memory 1605, a mass storage memory system 1620, other components 1607 necessary to operate computer 1601, and a databus 1609 which is coupled to each component in the computer to allow data transfer.
  • the databus 1609 may be any physical conduit of data and may actually comprise multiple databuses connected together.
  • Main memory 1605 could include a RAM or banks of RAM.
  • the CPU 1603 executes a stored program routine containing computer instructions to perform the slam-join, multi- slam-join or parallel-join techniques as described in conjunction with Figures 1 through 15.
  • Mass storage memory system 1620 includes an input memory portion 1611 for storing input data and an output memory portion 1615 for storing output data.
  • the two portions 1611 and 1615 can be part of the same physical memory. Alternatively, the two portions could be physically separated and each could be located outside the computer 1601 as external storage mediums.
  • Mass storage memory system 1620 can comprise multiple storage mediums connected together, such as multiple disks or magnetic tapes, when the amount of input data requires vast storage. Each input table could be stored on separate storage mediums.
  • Mass storage memory system 1620 could include any type of storage medium which is capable of storing a large amount of data. Some examples of storage devices which may be used in storage system 1620 include magnetic tapes, magnetic disks, optical disks, compact disks, or any other storage medium for large amounts of information.
  • Input memory portion 1611 contains at least two input tables to be joined.
  • Connector 1613 connects the computer 1601 to input memory portion 1611.
  • Processor 1603 may contain multiple CPUs to further enhance the efficiency of the operation by distributing the processing operation for each of the input tables between the processors. The distributed processing is particularly efficient in the parallel-join technique for multiple input tables because each table can be processed completely independently.
  • Output memory portion 1615 contains the results of the join operation.
  • Connector 1614 couples disk 1615 to computer 1601 and allows data to be transferred to and from computer 1601 to disk 1615.
  • Connector 1614 may be the same connector as connector 1613 which transfers data to and from the input and output memory portions if those portions are inter-connected.
  • the output files containing the join results for each individual input tables are stored separately in output memory portion 1615.
  • the separate storage locations are due to the vertical fragmentation used by the slam- join, multi-slam-join and parallel-join techniques.
  • Also located in output memory portion 1615 are the temporary files which are used in the slam-join technique and contain the join index RIDs for each input table except for the first table being processed. Alternatively, the temporary files could be stored in main memory 1605.
  • multiple computers could be used to process discrete portions of the data such as the records for individual input tables.
  • the slam-join, multi-slam-join and parallel-join techniques are easily adaptable to a distributed system.
  • the parallel-join technique is especially easy to implement on a distributed system because each input table can be processed individually. If a computer 601 had multiple users, the available main memory would include only that portion of main memory which was allocated to the particular user performing a parallel-merge type technique.
  • main memory 1615 is preferably smaller than at least one of the input tables stored in input memory portion 1611. This configuration will allow the slam- join, multi-slam-join and parallel-join techniques to save many I/O operations over other techniques.
  • Each input table is read in blocks from the input memory portion 1611, processed by the CPU 1603, and written to the output memory portion 1615.
  • Each input table is processed in this way and temporary files are created for each buffer in accordance with the parallel-merge type techniques.
  • Temporary files, if stored in the output memory portion 1615, are read back from the output memory portion 1615 to the CPU 1603 in order to sort them and sequentially read the remaining input tables.
  • the slam-join and multi-slam-join techniques allow each input table to be read only once from the input memory portion 911 for the entire join operation.
  • the selected records from each input table are written only once to the output memory portion 915.
  • This key feature of saving I/O cycles is accomplished by the merging of the RIDs to enable sequential reading of the input tables and the vertical fragmentation which allows each input table to be processed individually.
  • Other techniques discussed in the background section could not reduce the I/O cycles to this low level.
  • the parallel-join technique requires reading and writing each record in all but the first table to an output file one additional time. However, extra processing time can be saved if a distributed processor system is used.
  • the total size of the unique records from each input table (other than the first and last) that are mentioned in the join index should be compared to the size of the available main memory.
  • the size of the join index indicates the exact size of the join result, one record for each join index entry.
  • the same input table record may be indicated multiple times in a join index but only required to be read once. Therefore, the number of buffers is related to the number of unique input table records indicated in the join index.
  • the number of intermediate input table unique records is five times the size of the available main memory for a given input table, then five segments will be created and an allocation condition will be derived to ensure that each segment will have an associated output file of substantially equal size after the partitioning operation is done.
  • the number of buffers for the entire join operation will depend on the number of unique join index entries for each table other than the last in the join index.
  • the segmentation for the first input table will depend on the number of passes that are required to read in all of the selected first table records. The size of the last table to be joined is not a factor in allocating the buffers.
  • a segment as described above in accordance with Fig. 10, must fit into the main memory rather than a single buffer.
  • a segment for the first table would contain three buffers while a segment for the second table would contain four buffers.
  • No new buffers are created for the third input table. The buffers are created based on the size of available main memory, as some of the memory may be reserved for efficient sorting techniques for representing a portion of the join index or other purposes.
  • the partitioning values, or conditions as defined above, for each buffer associated with intermediate input tables can be defined exactly in order to substantially equally distribute the final join result records between the buffers by reading the join index and determining the conditions based on the RID values in the join index.
  • the relative size of the records can be taken into account. Since the join index lists all the RIDs of the records to be present in the join result for each input table, the list of unique RIDs can be apportioned almost exactly for all the buffers (or segments) .
  • a pre-processing step may be performed to read in the join index and calculate the partitioning conditions in this manner. Alternatively, if the join index must be computed, the partitioning values can be developed at the same time the join index is created.
  • any skew created by processing the tables can be prevented. If the buffers were simply partitioned on the tables entries evenly without consulting the join index RID values, skew would occur because not every record will be read, but only the selected records that appear in the join index. Some index RIDs may also appear numerous times on the join index when there are multiple matches.
  • An alternative way to determine the conditions for the intermediate input tables is to sample a portion of the join index. Given that the input tables are very large for optimal use, the sample would only need to be a small portion of the entire table to statistically calculate an allocation that is close to the optimal. Sampling the join index would save both I/O and processing operations.
  • the multi-slam-join technique does have some memory size requirements for processing its intermediate input tables in order to achieve optimal performance during operation.
  • Optimal operation of multi-slam-join does require a particular size of main memory for a given size of input tables to be processed. The requirement is due to the fact that a buffer or segment (made of multiple buffers) for the intermediate input tables (not the first and last) must fit into main memory all at once for processing.
  • the required main memory can be derived with the following equations:
  • the step of the technique of memory allocation and processing the middle input tables which must fit into main memory yields the equation: y * (x + ( (r-l) * v) ) ⁇ m
  • y the number of disk buffers
  • x length of an output-file sequence in a disk buffer
  • r number of tables to be joined
  • v length of temporary file sequence in a disk buffer
  • m size of main memory (in blocks) .
  • the table exhibits that the input tables can be very large compared to the size of the main memory. 128 MB is a typical size of RAM in a work-station. Larger input tables can be processed if the size of main memory increases. Smaller amounts of RAM can used to process smaller input tables. Since only selected records are read from each input table, the total size of the input table can be even greater since only the selected records will be read and processed. The size of the last input table is not constrained by memory because of the operation of the multi-slam-join technique.
  • the slam-join technique also has some memory size requirements in order to achieve optimal performance during operation. Optimal operation of the slam-join technique does require a particular size of main memory for a given size of input tables to be processed. The step of the technique of reading in as much as the first input table and join index as will fit into main memory yields the equation:
  • J the size of the join index
  • T the proportion of the records in table i that participate in the join (proportion of selected records)
  • R the number of blocks in the second table
  • y the number of buffers
  • m size of main memory.
  • slam-join has some advantages over its jive-join counterpart.
  • slam-join maximally utilizes memory in its initial phase independent of skew because it does not require an allocation pre-processing step. Instead it reads in as much of the input table and join index as will fit into memory. Additionally, slam-join is better able to adapt available main memory when a buffer partition has been completely processed. The allocated memory for the completed buffer is reallocated to the remaining buffers.
  • the slam-join technique is more efficient when the second input table is larger than the first due to the parallel-merge technique and the memory constraints.
  • the jive-join technique is more efficient when the first input table is larger than the second due to the allocation scheme which is independent of the size of the first table. If the two tables are the same size, slam-join is preferable for the reasons listed above of skew handling and memory adaptiveness.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Technique permettant de réaliser une association de plusieurs grands tableaux d'un système de base de données en utilisant un processeur mettant en ÷uvre une petite mémoire principale. Cette technique utilise un indice d'association (103 et 303) et minimise le nombre d'opérations d'entrées/sorties tout en maximisant l'utilisation d'une petite mémoire principale par un processus d'affectation de tampon (111 et 901). Les trois modes de réalisation utilisent tous l'opération fusion parallèle. La première technique d'association par juxtaposition associe deux tableaux sans affectation préalable de tampons tels que les tampons (111 et 112) qui sont utilisés ultérieurement dans le procédé. La deuxième technique d'association par juxtaposition multiples associe au moins trois tableaux, cette technique s'ajoutant à la technique de fusion parallèle qui subdivise la mémoire. La troisième technique d'association parallèle traite chaque tableau d'entrée de manière totalement indépendante. Cette technique de fusion parallèle identifie la valeur la plus basse parmi plusieurs fichiers et fait un classement par valeurs croissantes. Ceci permet de lire séquentiellement les fichiers d'entrée en économisant les opérations E/S.
PCT/US1996/015221 1995-09-21 1996-09-19 Realisation d'operations d'association de grands tableaux WO1997011433A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US08/531,789 US5666525A (en) 1995-09-21 1995-09-21 System and method for performing an efficient join operation on large tables with a small main memory
US08/531,789 1995-09-21
US08/632,958 1996-04-16
US08/632,958 US5802357A (en) 1995-09-21 1996-04-16 System and method for performing an efficient join operation

Publications (1)

Publication Number Publication Date
WO1997011433A1 true WO1997011433A1 (fr) 1997-03-27

Family

ID=27063639

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/015221 WO1997011433A1 (fr) 1995-09-21 1996-09-19 Realisation d'operations d'association de grands tableaux

Country Status (1)

Country Link
WO (1) WO1997011433A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3660694A1 (fr) * 2018-11-29 2020-06-03 Zizo Software Limited Procédé de structuration et d'interrogation d'une base de données
US11016978B2 (en) 2019-09-18 2021-05-25 Bank Of America Corporation Joiner for distributed databases
US11126401B2 (en) 2019-09-18 2021-09-21 Bank Of America Corporation Pluggable sorting for distributed databases
CN114528127A (zh) * 2022-03-31 2022-05-24 Oppo广东移动通信有限公司 数据处理方法、装置、存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5121494A (en) * 1989-10-05 1992-06-09 Ibm Corporation Joining two database relations on a common field in a parallel relational database field
US5241648A (en) * 1990-02-13 1993-08-31 International Business Machines Corporation Hybrid technique for joining tables
US5408654A (en) * 1992-05-27 1995-04-18 Cdb Software, Inc. Method to reorganize an index file without sorting by changing the physical order of pages to match the logical order determined from the index structure
US5557791A (en) * 1991-08-23 1996-09-17 International Business Machines Corporation Outer join operations using responsibility regions assigned to inner tables in a relational database

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5121494A (en) * 1989-10-05 1992-06-09 Ibm Corporation Joining two database relations on a common field in a parallel relational database field
US5241648A (en) * 1990-02-13 1993-08-31 International Business Machines Corporation Hybrid technique for joining tables
US5557791A (en) * 1991-08-23 1996-09-17 International Business Machines Corporation Outer join operations using responsibility regions assigned to inner tables in a relational database
US5408654A (en) * 1992-05-27 1995-04-18 Cdb Software, Inc. Method to reorganize an index file without sorting by changing the physical order of pages to match the logical order determined from the index structure

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ACM COMPUTING SURVEYS, March 1992, Vol. 24, No. 1, MISHRA et al., "Join Processing in Relational Databases", pages 63-113. *
ACM TRANSACTIONS ON DATABASE SYSTEMS, June 1987, Vol. 12, No. 2, VALDURIEZ, P., "Join Indices", pages 218-246. *
PROCEEDINGS FIFTH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, February 1989, KITSUREGAWA et al., "Join Strategies on KD-Tree Indexed Relations", pages 85-93. *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3660694A1 (fr) * 2018-11-29 2020-06-03 Zizo Software Limited Procédé de structuration et d'interrogation d'une base de données
US11016978B2 (en) 2019-09-18 2021-05-25 Bank Of America Corporation Joiner for distributed databases
US11126401B2 (en) 2019-09-18 2021-09-21 Bank Of America Corporation Pluggable sorting for distributed databases
CN114528127A (zh) * 2022-03-31 2022-05-24 Oppo广东移动通信有限公司 数据处理方法、装置、存储介质及电子设备

Similar Documents

Publication Publication Date Title
US5802357A (en) System and method for performing an efficient join operation
US5983215A (en) System and method for performing joins and self-joins in a database system
US5649181A (en) Method and apparatus for indexing database columns with bit vectors
US5710915A (en) Method for accelerating access to a database clustered partitioning
US7213025B2 (en) Partitioned database system
US5544357A (en) Database accelerator
US5222235A (en) Databases system for permitting concurrent indexing and reloading of data by early simulating the reload process to determine final locations of the data
Battle et al. Dynamic reduction of query result sets for interactive visualizaton
US7454403B2 (en) Method and mechanism of improving performance of database query language statements using data duplication information
US8099440B2 (en) Method for laying out fields in a database in a hybrid of row-wise and column-wise ordering
US5870743A (en) Method and apparatus for parallelizing operations that create a table
US7765211B2 (en) System and method for space management of multidimensionally clustered tables
Chen et al. Efficient organization and access of multi-dimensional datasets on tertiary storage systems
WO2006046669A1 (fr) Dispositif méthode et programme de gestion de base de données
JPH09212528A (ja) データベースを記憶する方法、データベースからレコードを検索する方法、および、データベース記憶/検索システム
US5873091A (en) System for data structure loading with concurrent statistical analysis
CN102362273A (zh) 用于关系数据库系统中高效数据存取的动态散列表
Liou et al. Multi-dimensional clustering for data base organizations
US20230103328A1 (en) Data compression techniques
Nodine et al. Large-scale sorting in parallel memories
US7035851B1 (en) Reorganization of database records using partition merging
CN111459884B (zh) 一种数据的处理方法、装置、计算机设备和存储介质
US6895487B2 (en) Methods for intra-partition parallelism for inserts
US6622141B2 (en) Bulk loading method for a high-dimensional index structure
JP4673299B2 (ja) 情報処理方法及び情報処理システム

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA