CN103092885A

CN103092885A - Method and device for creating sparse indexes, sparse index and query method and device

Info

Publication number: CN103092885A
Application number: CN2011103476374A
Authority: CN
Inventors: 周大; 钱岭; 郭磊涛; 齐骥
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2011-11-07
Filing date: 2011-11-07
Publication date: 2013-05-08

Abstract

The invention discloses a method and a device for creating sparse indexes. For data records to be processed, a same hash function is used for calculating hashed values of key values, the data records are saved into corresponding subareas according to the calculated hashed values, and the data records saved into the same subarea have same hashed values; for any subarea, in the initial phase, the content in the subarea is empty, the saved data records are used for forming a file block when the saved data records reach preset requirements, and the data records not forming the file block are used for forming another file block when the saved data blocks not forming the file block reach preset requirements again, and so on; and an index entry is created for every formed file block. According to the method and the device, the creating speed of sparse indexes can be fastened. The invention discloses a sparse index and a query method and device based on the sparse index simultaneously.

Description

The method for building up of sparse index and device, sparse index and querying method and device

Technical field

The present invention relates to data processing technique, particularly a kind of method for building up of sparse index and device, a kind of sparse index, and a kind of querying method and device based on this sparse index.

Background technology

When carrying out the data loading, for ease of subsequent query, usually can set up index for data recording, described index can be dense index or sparse index etc.

Wherein, dense index need to be set up respectively an index entry for each data recording, and sparse index only need to be set up respectively an index entry for each grouping, comprises respectively several data recording in each grouping.

In prior art, usually set up in such a way sparse index: according to certain rule, to each pending data recording, namely each data recording to be loaded sorts such as key assignments order from small to large; Data recording after sequence is carried out cutting, obtain several groupings; For each grouping, set up respectively an index entry, include a key assignments and a pointer in each index entry, key assignments typically refers to the key assignments of first data recording in grouping, pointed be the reference position of first data recording in grouping.

Fig. 1 is the schematic diagram of the sparse index set up according to existing mode.As shown in Figure 1,010101,020101 etc. is key assignments, and the delegation of thick arrow indication is a data record, and front 3 data record is as a grouping, and rear 4 data record is as a grouping.

But, can there be certain problem in aforesaid way in actual applications, because needs first sort to each data recording, then just can carry out subsequent treatment that is:, therefore and that the process implementation of sequence gets up is very complicated, can cause the speed of setting up of sparse index very slow.

Summary of the invention

In view of this, the invention provides a kind of method for building up and device of sparse index, can accelerate the speed of setting up of sparse index.

The present invention provides a kind of sparse index and simultaneously based on querying method and the device of this sparse index.

For achieving the above object, technical scheme of the present invention is achieved in that

A kind of method for building up of sparse index comprises:

For each pending data recording, to utilize respectively same hash function to calculate the hashed value of its key assignments, and according to the hashed value that calculates, this data recording is saved in corresponding subregion, the data recording that is saved in same subregion has identical hashed value;

For arbitrary subregion, starting stage, content wherein is empty, when the data recording of preserving reaches pre-provisioning request, utilize the data recording of preserving to form a blocks of files, when the data recording of the not composing document piece of preserving reaches pre-provisioning request again, utilize not that the data recording of composing document piece forms another blocks of files, the like; Blocks of files of every composition is set up an index entry for this document piece.

A kind of apparatus for establishing of sparse index comprises:

Computing module is used for for each pending data recording, utilizes respectively same hash function to calculate the hashed value of its key assignments, and this data recording and the hashed value that calculates are sent to sets up module;

The described module of setting up is used for according to the hashed value that receives, the data recording that receives being saved in corresponding subregion, and the data recording that is saved in same subregion has identical hashed value; For arbitrary subregion, starting stage, content wherein is empty, when the data recording of preserving reaches pre-provisioning request, utilize the data recording of preserving to form a blocks of files, when the data recording of the not composing document piece of preserving reaches pre-provisioning request again, utilize not that the data recording of composing document piece forms another blocks of files, the like; Blocks of files of every composition is set up an index entry for this document piece.

A kind of sparse index comprises:

The respectively corresponding index entry of each blocks of files in each subregion; Each subregion has respectively a numbering that is different from other subregion, and each blocks of files has respectively a numbering that is different from other blocks of files in same subregion;

Comprise respectively in each index entry: largest key value, minimum key value, partition number, blocks of files number and hash function name; Wherein,

Largest key value refers to the maximal value in the key assignments of each data recording in blocks of files corresponding to this index entry;

Minimum key value refers to the minimum value in the key assignments of each data recording in blocks of files corresponding to this index entry;

Partition number refers to the numbering of the subregion under blocks of files corresponding to this index entry;

Blocks of files number refers to the numbering of the blocks of files that this index entry is corresponding.

A kind of querying method based on above-mentioned sparse index comprises:

Receive key assignments to be checked, and find out minimum key value be less than or equal to key assignments to be checked and largest key value more than or equal to the index entry of key assignments to be checked from each index entry, with the index entry that finds out as the candidate index item;

For each candidate index item, utilize respectively hashed value and the minimum key value in this candidate index item or the hashed value of largest key value of the hash function calculating key assignments to be checked of hash function name correspondence wherein, if the hashed value of key assignments to be checked equals minimum key value in this candidate index item or the hashed value of largest key value, with this candidate index item index entry as a result of;

Travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered.

A kind of inquiry unit based on above-mentioned sparse index comprises:

Receiver module is used for receiving key assignments to be checked, and sends to processing module;

Described processing module is used for finding out minimum key value from each index entry and is less than or equal to key assignments to be checked and largest key value more than or equal to the index entry of key assignments to be checked, with the index entry that finds out as the candidate index item; For each candidate index item, utilize respectively hashed value and the minimum key value in this candidate index item or the hashed value of largest key value of the hash function calculating key assignments to be checked of hash function name correspondence wherein, if the hashed value of key assignments to be checked equals minimum key value in this candidate index item or the hashed value of largest key value, with this candidate index item index entry as a result of; Travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered.

As seen, adopt technical scheme of the present invention, need not each pending data recording is sorted, can complete the foundation of sparse index, thereby accelerated the speed of setting up of sparse index, and can complete data query based on this sparse index.

Description of drawings

Fig. 1 is the schematic diagram of the sparse index set up according to existing mode.

Fig. 2 is the process flow diagram of the method for building up embodiment of sparse index of the present invention.

Fig. 3 is the process of the setting up schematic diagram of sparse index of the present invention.

Fig. 4 is the schematic diagram of the sparse index set up according to mode of the present invention.

Fig. 5 is the composition structural representation of the apparatus for establishing embodiment of sparse index of the present invention.

Fig. 6 is the composition structural representation of the inquiry unit embodiment of sparse index of the present invention.

Embodiment

For problems of the prior art, the scheme of setting up of the sparse index in the present invention after a kind of improvement of proposition need not each pending data recording is sorted, thereby has accelerated the speed of setting up of sparse index.

For make technical scheme of the present invention clearer, understand, referring to the accompanying drawing embodiment that develops simultaneously, scheme of the present invention is described in further detail.

Fig. 2 is the process flow diagram of the method for building up embodiment of sparse index of the present invention.As shown in Figure 2, comprise the following steps:

Step 21: for each pending data recording, utilize respectively same hash function to calculate the hashed value of its key assignments, and according to the hashed value that calculates, this data recording is saved in corresponding subregion, the data recording that is saved in same subregion has identical hashed value.

Usually, carrying out need to setting up index when data load, therefore, above-mentioned pending data recording typically refers to data recording to be loaded.

In this step, obtain respectively each pending data recording, and for each data recording that gets, process in such a way respectively:

1) utilize hash function to calculate the hashed value of the key assignments of this data recording;

2) according to the hashed value that calculates, this data recording is saved in corresponding subregion, the data recording that is saved in same subregion has identical hashed value.

Concrete which kind of hash function of employing calculates hashed value and can be decided according to the actual requirements.Can calculate the different hashed value of how many kinds of according to hash function, namely have what different subregions.

Step 22: for arbitrary subregion, starting stage, content wherein is empty, when the data recording of preserving reaches pre-provisioning request, utilize the data recording of preserving to form a blocks of files, when the data recording of the not composing document piece of preserving reaches pre-provisioning request again, utilize not that the data recording of composing document piece forms another blocks of files, the like; Blocks of files of every composition is set up an index entry for this document piece.

For arbitrary subregion, starting stage, content wherein is empty, along with being on the increase of the data recording of preserving, progressively produces different blocks of files, that is: when the data recording of preserving reaches pre-provisioning request, utilize the data recording of preserving to form a blocks of files, afterwards, when the data recording of the not composing document piece of preserving reaches pre-provisioning request again, utilize not that the data recording of composing document piece forms another blocks of files, the like; Distinguishingly, need to preserve when no longer including new data recording, namely all pending data recording all are disposed, but the data recording of the not composing document piece of preserving is not when reaching pre-provisioning request, and utilizing not, the data recording of composing document piece forms a blocks of files.

illustrate: suppose above-mentionedly to reach pre-provisioning request and refer to reach predetermined number, and the hypothesis predetermined number refers to 100, so, for arbitrary subregion, when the data recording of preserving reaches 100, utilize these 100 data records to form a blocks of files, after this, if newly preserved again 100 data records, utilize these 100 new blocks of files of data recording recomposition of preserving, the like, need to preserve when no longer including new data recording, but when also having 50 not belong to arbitrary blocks of files in the data recording of having preserved, these 50 data records are formed a blocks of files.

Need to prove, reach pre-provisioning request and be not limited to refer to reach predetermined number, also can refer to reach other requirement, reach predetermined threshold etc. as total amount of data.

In actual applications, can be each subregion a numbering that is different from other subregion is set respectively, if any N subregion, can be numbered respectively subregion 1～subregion N, and, for each blocks of files arranges respectively a numbering that is different from other blocks of files in same subregion, as M blocks of files arranged in a subregion, can be numbered respectively blocks of files 1～blocks of files M, M and N are the positive integer greater than 1, usually, the numbering of the blocks of files that more first forms is less, be blocks of files 1 as the blocks of files that forms at first, be blocks of files 2 afterwards, then be blocks of files 3 afterwards.

After blocks of files of every composition, be it and set up an index entry, comprising 5 property values: largest key value, minimum key value, partition number, blocks of files number and hash function name;

Blocks of files number refers to the numbering of the blocks of files that this index entry is corresponding;

The title of the hash function that uses when the hash function name refers to calculate hashed value.

Based on above-mentioned introduction, can obtain sparse index shown in Figure 3 and set up the process schematic diagram.For each pending data recording, process according to mode shown in Figure 3 respectively, finally obtain several blocks of files, the corresponding index entry of each blocks of files.

Fig. 4 is the schematic diagram of the sparse index set up according to mode of the present invention.As shown in Figure 4, the delegation of arrow indication is an index entry.

The present invention provides a kind of querying method based on above-mentioned sparse index simultaneously, comprising:

1) receive key assignments to be checked, and find out minimum key value be less than or equal to key assignments to be checked and largest key value more than or equal to the index entry of key assignments to be checked from each index entry, with the index entry that finds out as the candidate index item;

Namely search all index entries, will meet the index entry of " minimum key value≤key assignments≤largest key value to be checked " this condition as the candidate index item.

2) for each candidate index item, utilize respectively hashed value and the minimum key value in this candidate index item or the hashed value of largest key value of the hash function calculating key assignments to be checked of hash function name correspondence wherein, if the hashed value of key assignments to be checked equals minimum key value in this candidate index item or the hashed value of largest key value, with this candidate index item index entry as a result of.

3) travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered;

By step 1)～2) processing, can only obtain index entry as a result, if obtain concrete data recording, also need to find each blocks of files corresponding to index entry as a result according to blocks of files number and partition number information, and travel through these blocks of files.

Above-mentioned process and the corresponding process of inquiring about of setting up sparse index can integral body be exemplified below:

At field of telecommunications, the rise time information of normally this data recording of carrying in the key assignments of each data recording can comprise year, month, day, hour, min etc.

so, can utilizing by the hour, the hash function of subregion is saved in each data recording in different subregions, as for data recording X, if determine that according to its key assignments its rise time is (not consider year at 0 o'clock, month, day, minute), it is saved in subregion 1, for data recording Y, if determine that according to its key assignments its rise time is at 1 o'clock, it is saved in subregion 2, for data recording Z, if determine that according to its key assignments its rise time is at 2 o'clock, it is saved in subregion 3, the like, like this, 24 subregions have been met together, can be numbered respectively subregion 1～subregion 24.

Have respectively several blocks of files in each subregion, suppose that the blocks of files number average in each subregion is 5 (the blocks of files number in practical application in each subregion is usually different), so, these 5 blocks of files can be numbered respectively blocks of files 1～blocks of files 5; And, for each blocks of files is set up respectively an index entry, comprise: largest key value, minimum key value, partition number, blocks of files number and hash function name, wherein, largest key value and minimum key value are all for complete temporal information, namely comprise year, month, day, hour, min etc., for any two key assignments A and B, if the temporal information in key assignments A is nearer apart from the current time than the temporal information in key assignments B, can think that so key assignments A is greater than key assignments B.

Follow-up, when receiving key assignments to be checked, complete temporal information in key assignments to be checked according to this and the largest key value in each index entry and minimum key value information are determined the candidate index item, the blocks of files of these candidate index item correspondences may come from different subregions, screen undesirable index entry by calculating hashed value, the remaining index entry as a result that is, at last, travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered, namely key assignments equals the data recording of key assignments to be checked.

Based on above-mentioned introduction, Fig. 5 is the composition structural representation of the apparatus for establishing embodiment of sparse index of the present invention.As shown in Figure 5, comprising:

Set up module, be used for according to the hashed value that receives, the data recording that receives being saved in corresponding subregion, the data recording that is saved in same subregion has identical hashed value; For arbitrary subregion, starting stage, content wherein is empty, when the data recording of preserving reaches pre-provisioning request, utilize the data recording of preserving to form a blocks of files, when the data recording of the not composing document piece of preserving reaches pre-provisioning request again, utilize not that the data recording of composing document piece forms another blocks of files, the like; Blocks of files of every composition is set up an index entry for this document piece.

The above-mentioned module of setting up can be further used for, and for arbitrary subregion, need to preserve when no longer including new data recording, but the data recording of the not composing document piece of preserving is not when reaching pre-provisioning request, and utilizing not, the data recording of composing document piece forms a blocks of files.

Above-mentionedly reach pre-provisioning request and typically refer to and reach predetermined number.

Each subregion has respectively a numbering that is different from other subregion, and each blocks of files has respectively a numbering that is different from other blocks of files in same subregion;

Also can further comprise in device shown in Figure 5:

Enquiry module is used for receiving key assignments to be checked, and finds out minimum key value be less than or equal to key assignments to be checked and largest key value more than or equal to the index entry of key assignments to be checked from each index entry, with the index entry that finds out as the candidate index item; For each candidate index item, utilize respectively hashed value and the minimum key value in this candidate index item or the hashed value of largest key value of the hash function calculating key assignments to be checked of hash function name correspondence wherein, if the hashed value of key assignments to be checked equals minimum key value in this candidate index item or the hashed value of largest key value, with this candidate index item index entry as a result of; Travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered.

Fig. 6 is the composition structural representation of the inquiry unit embodiment of sparse index of the present invention.As shown in Figure 6, comprising:

Processing module is used for finding out minimum key value from each index entry and is less than or equal to key assignments to be checked and largest key value more than or equal to the index entry of key assignments to be checked, with the index entry that finds out as the candidate index item; For each candidate index item, utilize respectively hashed value and the minimum key value in this candidate index item or the hashed value of largest key value of the hash function calculating key assignments to be checked of hash function name correspondence wherein, if the hashed value of key assignments to be checked equals minimum key value in this candidate index item or the hashed value of largest key value, with this candidate index item index entry as a result of; Travel through each data recording in each blocks of files that index entry is corresponding as a result, obtain the data recording that key-value pair to be checked is answered;

The structure of described index entry can with reference to above stated specification, repeat no more herein.

The above is only preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. the method for building up of a sparse index, is characterized in that, comprising:

2. method according to claim 1, it is characterized in that, the method further comprises: for arbitrary subregion, need to preserve when no longer including new data recording, but when the data recording of the not composing document piece of preserving did not reach pre-provisioning request, utilizing not, the data recording of composing document piece formed a blocks of files.

3. method according to claim 1 and 2, is characterized in that, describedly reaches pre-provisioning request and comprise: reach predetermined number.

4. method according to claim 1, is characterized in that,

5. method according to claim 4, is characterized in that, when described sparse index set up complete after, further comprise:

6. the apparatus for establishing of a sparse index, is characterized in that, comprising:

7. device according to claim 6, it is characterized in that, the described module of setting up is further used for, for arbitrary subregion, need to preserve when no longer including new data recording, but when the data recording of the not composing document piece of preserving did not reach pre-provisioning request, utilizing not, the data recording of composing document piece formed a blocks of files.

8. according to claim 6 or 7 described devices, is characterized in that, describedly reaches pre-provisioning request and comprise: reach predetermined number.

9. device according to claim 6, is characterized in that,

10. device according to claim 9, is characterized in that, described device further comprises:

11. a sparse index is characterized in that, comprising:

12. the querying method based on the described sparse index of claim 11 is characterized in that, comprising:

13. the inquiry unit based on the described sparse index of claim 11 is characterized in that, comprising: