CN102609490B

CN102609490B - Column-storage-oriented B+ tree index method for DWMS (data warehouse management system)

Info

Publication number: CN102609490B
Application number: CN201210019935.5A
Authority: CN
Inventors: 夏小玲; 乐嘉锦; 王梅; 李晔锋
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2012-01-20
Filing date: 2012-01-20
Publication date: 2014-07-02
Anticipated expiration: 2032-01-20
Also published as: CN102609490A

Abstract

The invention relates to a column-storage-oriented B+ tree index method for a DWMS (data warehouse management system). The column-storage-oriented B+ tree index method is characterized by comprising a first step, generating column data; a second step, turning to a fourth step for building if a B+ tree keyword is a row number, and turning to a third step for sorting if the B+ tree keyword is not the row number; the third step, sorting column value data by means of using multi-line merging with heapsort; the fourth step, initiating a B+ tree index; a fifth step, creating leaf nodes; and a sixth step, generating a middle nodes an a bottom-up manner. The column-storage-oriented B+ tree index method for the DWMS is used for column storage, and has the advantages that 1), the number of layers of a B+ tree is the smallest, and the number of searching is reduced; and 2), a traditional plug-in method for building the B+ tree is abandoned, and the bottom-up B+ tree creating method is utilized. When the method is used, division operation is omitted, and a lot of expense is reduced.

Description

A kind of B+ tree indexing means towards row storage DWMS

Technical field

The present invention relates to a kind of tree of the B+ towards row storage DWMS (Data Warehouse Management System) index technology.

Background technology

The high development of internet, applications, high-throughput and large buffer memory become the nowadays essential feature of database product, day by day urgent for issued transaction and the high performance requirement of query analysis.Traditional row stored data base can not be for business decision provides efficient query analysis as processing transactions application, the database schema of row storage is in recent years by re-examine, for the inquiry type work of reading to optimize in data warehouse or in analysis support application, row storage list reveals than row to be stored and has performance more significantly.Because relation table is in external performance, be still logical line, to be therefore connected with multilist be to be listed as the key factor that affects query performance in storing in tuple reconstruct.Index is one of important technology improving search efficiency.B+ tree index can keep data to store in order, and have advantages of allow to search, sequential access, insertion and deletion, make it in transaction environment, become the most popular index structure of Database Systems.

In traditional Database Systems, the variation of the B+ tree construction that data insertion and modification bring is frequently very large.In order to reduce the possibility that causes node split when data are inserted and revised, the node of B+ tree is not what fill up.But in data warehouse, almost do not have data to insert and retouching operation.Traditional B+tree is applied in the data warehouse of mass data storage and will causes the waste in space.Meanwhile, because node is not filled completely, data storage needs more node, and this will increase the height of B+ tree index, reduces the search efficiency of data.

Summary of the invention

The object of this invention is to provide a kind of B+ tree index that is applied to row storage DWMS, overcome the limitation of traditional B+tree index, improve the search efficiency of data.

In order to achieve the above object, technical scheme of the present invention has been to provide a kind of B+ tree indexing means towards row storage DWMS, it is characterized in that, step is:

Step 1, column data generate: import user data, the original data by row storage are divided vertically into single-row, for each item number of each row is according to according to the line number at its place, add the line number item of rebuilding for tuple, form two tuple (line numbers, train value), request for data section, is kept at the every train value data that newly produce in a data segment;

If step 2 B+ tree key word is line number data, turn to step 4 to create; If B+ tree key word is train value data, goes to step 3 and first train value is sorted;

Step 3, multiway merge and heapsort are used in combination train value data are sorted;

Step 4, the initialization of B+ tree index;

Step 5, establishment B+ leaf node: the leafy node of application B+ tree, data item is filled directly into leafy node and obtains data block, form the 0th layer, B+ tree;

The feature of step 5 is:

According to row storage characteristics, direct save data information in leaf piece, but not the pointer of key assignments and sensing data block.In search procedure, can directly obtain data by leafy node like this, reduce one time I/O; When node is filled, no longer consider sparse coefficient, the each node in tree fills up, and has adopted the principle of " key for searching number is consistent with pointer number ", room for promotion utilization factor.

Step 6, generation intermediate node: iteration is set up the middle layer node of B+ tree until whole B+ tree creates end from the bottom to top.

Preferably, described step 3 comprises:

Step 3.1, initialization: in internal memory, apply for a sequence district, its size is designated as K, and making K is the integral multiple of block size.If pending data is counted Blk_num > K according to piece, the method for multiway merge external sort will be adopted.The size of M while calculating multiway merge, sub-list number when M is merger, order

in each sub-list for the treatment of merger, piece is counted D=Blk_num/M.

Source data in step 3.2, the section of reading, the train value Data Division in data block two tuples out, packs dataitem array into, puts into sequence district.Use Heap algorithm using dataitem array as input parameter, dataitem array is sorted; If Blk_num is <=K, sequence finishes; Otherwise, sorted data item is reassembled into piece, write back in interim section.

Step 3.3, the interim section of M is carried out to merge sort.

Preferably, described step 6 comprises:

Step 6.1, generating indexes item, the tlv triple that first train value, the line number of correspondence and the piece of this piece number that index entry is served as reasons in lower one deck piece forms;

Step 6.2, judge that train value type is fixed-length data or elongated data, if fixed length train value, calculate the length of index entry, and then can obtain the index entry number=index block space size/index entry length in index block, the index entry number in the data block number/each node in index block number=0th in ground floor layer.Application index block space, is inserted into index entry in index block in batches; If elongated train value, index block of first to file, then inserts tlv triple successively, until just apply for new index block while failing to lay down;

After step 6.3, one deck have created, turn to step 5.1, create last layer intermediate node according to same process, finish until whole B+ tree creates.

The invention provides a kind of B+ tree index tool that is applied to row storage DWMS has the following advantages:

1) provide improved B+ tree index node, the direct memory row data of leafy node, intermediate node is no longer considered sparse coefficient, all fills up, and has increased space availability ratio.For the data of same number, this structure guarantees that B tree hierachy is the shortest, has reduced and has searched number of times;

2) traditional insertion has been abandoned in the foundation of B+ tree, uses the method for bottom-up structure B+ tree.Make not need to consider in this way splitting operation, reduced a large amount of expenses;

3) line number in row storage system and train value data are set up respectively to B+ tree index, keep data to store in order, be convenient to tuple reconstruct and be connected with multilist.

Accompanying drawing explanation

Fig. 1 is the structure that B+ sets inner node (index entry);

Fig. 2 is the structure of B+ leaf node (data page).

Embodiment

For the present invention is become apparent, be hereby described in detail as follows with a preferred embodiment.

The invention provides a kind of B+ tree indexing means towards row storage DWMS, the steps include:

Step 3, multiway merge and heapsort are used in combination train value data are carried out to key assignments sequence, this step comprises:

Step 3.1, initialization: in internal memory, apply for a sequence district, its size is designated as K, and making K is the integral multiple of block size.If pending data is counted Blk_num > K according to piece, the method for multiway merge external sort will be adopted.The size of M while calculating multiway merge, sub-list number when M is merger, order in each sub-list for the treatment of merger, piece is counted D=Blk_num/M.

Step 3.3, the interim section of M is carried out to merge sort.

Step 4, B+ tree initialization, specifically comprise the descriptor of B+ tree carried out to assignment, needs theing contents are as follows of assignment:

(1) type of B+ tree index key assignments, comprises fixed/elongated, value type whether.The key assignments of fixed length comprises: SMALLINT, INT, NUMBER, CHAR, DATE, TIME; Elongated key assignments has the VARCHAR of comprising.

(2) information of row, comprises row name, row type, row length, if row are elongated, specifies maximum length.

(3) be B+ tree root piece allocation space.

(4) the level value that B+ tree is set is 0.

Step 5, establishment leafy node: the leafy node of application B+ tree, data item is filled directly into leafy node and obtains data block, form the 0th layer, B+ tree; This step is according to row storage characteristics, direct save data information in leaf piece, but not the pointer of key assignments and sensing data block.In search procedure, can directly obtain data by leafy node like this, reduce one time I/O.In addition, when node is filled, no longer consider sparse coefficient, the each node in tree fills up, room for promotion utilization factor, as shown in Figures 1 and 2.KEY in figure _iby < line number, train value > two parts composition.

Step 6, generation intermediate node: iteration is set up the middle layer node of B+ tree until whole B+ tree creates end from the bottom to top, the steps include:

Step 6.1, generating indexes item, the tlv triple that index entry is made up of first train value in lower one deck piece, the line number of correspondence and the piece number of this piece, the length of computation index item, is designated as TL;

Step 6.2, application index block space: judge that according to initialization information in step 4 train value is fixed length or elongated, if fixed length train value calculates the length of index entry.Index entry number=each index block space size/index entry length T L in index block.Calculate the index entry number in the data block number/each index block in index block number=0th in ground floor layer.Apply for the index block space of this layer.Index entry is inserted in index block in batches; If elongated train value, index block of first to file, then inserts tlv triple successively, until just apply for new index block while failing to lay down.

After step 6.3, one deck have created, level=level+1; Turn to step 6.1, create last layer intermediate node according to same process, difference is the index entry number in index block number/each index block of the index block number=lower one deck that calculates current layer.In the time that the index block number in this layer is 1, iteration finishes, and this index block is the root node of B+ tree.

Claims

1. towards a B+ tree indexing means of row storage DWMS, it is characterized in that, step is:

Step 3, multiway merge and heapsort are used in combination train value data are sorted, described step 3 comprises:

Step 3.1, initialization: in internal memory, apply for a sequence district, its size is designated as K, making K is the integral multiple of data block size to be sorted, if pending data is counted Blk_num>K according to piece, to adopt the method for multiway merge external sort, sub-list number when merger is M, order

in each sub-list for the treatment of merger, piece is counted D=Blk_num/M, source data in reading out data section, the train value Data Division in data block two tuples out, pack array into, put into sequence district, use Heap algorithm using array as input parameter, array is sorted; If Blk_num<=K, sequence finishes; Otherwise, sorted data item is reassembled into data block, write back in interim section;

Step 3.2, the interim section of N is carried out to merge sort;

Step 4, the initialization of B+ tree index;

Step 6.1, generating indexes item, the tlv triple that first train value, the line number of correspondence and the piece of this index block number that index entry is served as reasons in next layer index piece forms;

Step 6.2, judge that train value type is fixed-length data or elongated data, if fixed length train value, calculate the length of index entry, and then can obtain the index entry number=index block space size/index entry length in index block, index entry number in data block number/each node in index block number=0th in ground floor layer, application index block space, is inserted into index entry in index block in batches; If elongated train value, index block of first to file, then inserts tlv triple successively, until just apply for new index block while failing to lay down;

After step 6.3, one deck have created, turn to step 6.1, create last layer intermediate node according to same process, finish until whole B+ tree creates.