CN101782922B

CN101782922B - Multi-level bucket hashing index method for searching mass data

Info

Publication number: CN101782922B
Application number: CN2009102561033A
Authority: CN
Inventors: 王希常; 马磊; 刘江
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO., LTD.
Priority date: 2009-12-29
Filing date: 2009-12-29
Publication date: 2012-01-18
Anticipated expiration: 2029-12-29
Also published as: CN101782922A

Abstract

The invention relates to a multi-level bucket hashing index method for searching mass data, which relates to the field of the mass data storage, and is characterized in that: (1) a bucket mapping table is provided for reducing the occupation space of the hashing index document on the disk; (2) hashing index adopts multi-level bucket, the size of the disk blocks is integral multiple of that of sectors, each disk block is provided with one or more basic buckets and can be provided with within-block overflow bucket and can be provided with overall overflow buckets; (3) a data caching structure for providing the index documents provides data caching mapping tables, and the management of the data caching is realized by a double-linked list on the caching mapping table. The occupation space of the index document on the disk is reduced through the mapping table; and the area of the disk blocks is integral multiple of that of the disk sectors, the disk reading-writing times can be reduced through the data caching structure, and the internal memory utilization efficiency and the searching efficiency of the data are improved.

Description

A kind of multi-level bucket hash indexing method towards searching mass data

Technical field

The present invention relates to a kind of multi-level bucket hash indexing method, belong to data storage, retrieval technique field towards searching mass data.

Background technology

Recall precision is an important indicator of mass data storage, service application system; Index technology has important effect in data space tissue and retrieval; Large database and data-storage applications system all support the hashed table index technology at present; Data Source increases rapidly, how fast to obtain information of interest exactly, becomes the subject matter that people pay close attention to; Therefore characteristics such as magnanimity are had higher requirement to retrieval technique, various information retrievals, filtration, extractive technique become the emphasis of research gradually.A very important advantage of hash index is that recall precision does not increase with the growth of data volume, and the principal element that influences the Hash performance is disk read-write number of times and hash-collision problem.Hash index mainly contains dual mode at present, static hash index and dynamic hash index.

Summary of the invention

To the deficiency of prior art, the present invention provides a kind of multi-level bucket hash indexing method towards searching mass data.

A kind of multi-level bucket hash indexing method towards searching mass data comprises the creation method and the search method of hash index, and the creation method of hash index is following:

1) information of creating index is confirmed a key word;

2) in calculator memory, set up the mapping table of index bucket, i.e. the cryptographic hash h of key word and the index bucket memory location c on disk;

3) judge that the index bucket whether on disk, judges promptly whether the value of memory location equals the maximal value of 8 bytes; If equal the maximal value of 8 bytes, the index bucket of not stored on the disk is described, continue step 4); If be not equal to the maximal value of 8 bytes, the existing index bucket of having stored on the disk is described, change step (7);

During the index bucket 4) do not stored on the disk, on disk, create a new disk block d and a canned data, set up a new index bucket, confirm the new sequence number of index bucket in disk block d;

5) upgrade mapping table, make c=d;

6) upgrade disk, repeatedly storage;

When 7) having the index bucket of having stored on the disk, confirm the sequence number of this index bucket in disk block;

8) judge that whether this index bucket has enough new key words of space storage, if enough spaces are arranged, changes step (6); If there are not enough spaces, key word overflows at this index bucket, stores overflow bucket in the disk block into; If overflow bucket does not have enough spaces yet in the disk block, key word overflow bucket in disk block overflows, and stores overall overflow bucket into.

The search method of hash index is following:

1) information of treating search index is confirmed a key word;

2) read mapping table;

3) judge that index bucket to be retrieved whether on disk, judges promptly whether the value of memory location equals the maximal value of 8 bytes; If equal the maximal value of 8 bytes, the index bucket of not stored on the disk to be retrieved is described, retrieval finishes; If be not equal to the maximal value of 8 bytes, the existing index bucket of having stored to be retrieved on the disk is described, change step (4);

4), obtain the disk block number of index barrel number to be retrieved and this index bucket place disk block in the mapping table if be not equal to the maximal value of 8 bytes;

5) retrieval in the bucket, if retrieve, then retrieval finishes; If retrieval is less than, overflow bucket retrieval in disk block;

6) overflow bucket retrieves in disk block, and retrieval finishes; If in disk block overflow bucket retrieval less than, then in overall overflow bucket retrieval, retrieval finishes.

When storage and retrieval mass data, index file itself is bigger, and hash index takes up room bigger; In order to reduce the hash index file as far as possible; Improve dusk utilization and file read performance, the invention provides a barrel mapping table, avoided the empty bucket in the hash index file; During data retrieval, read number of times, the invention provides cache management, improved the utilization factor of internal memory, when the data of bucket are in internal memory,, avoided the read operation of disk directly from interior access data in order to reduce disk; In order to reduce the performance decline that hash-collision causes, the invention provides based on overflow bucket and overall overflow bucket in the disk block structured piece, having reduced the disk read-write operation that conflict causes, improved efficient, the present invention of experiment proof has very high practical value.

The present invention can make full use of disk and internal memory, and reduces the disk read-write number of times, improves mass data storage, recall precision.

Description of drawings

Fig. 1 is the index creation process flow diagram.

Fig. 2 is the indexed search process flow diagram.

Embodiment:

1) information of creating index is confirmed a key word;

5) upgrade mapping table, make c=d;

6) upgrade disk, repeatedly storage;

The search method of hash index is following:

1) information of treating search index is confirmed a key word;

2) read mapping table;

Claims

1. the multi-level bucket hash indexing method towards searching mass data is characterized in that, method comprises the creation method and the search method of hash index, and the creation method of hash index is following:

1) information of creating index is confirmed a key word;

5) upgrade mapping table, make c=d;

6) upgrade disk, repeatedly storage;

8) judge that whether this index bucket has enough new key words of space storage, if enough spaces are arranged, changes step (6); If there are not enough spaces, key word overflows at this index bucket, stores overflow bucket in the disk block into; If overflow bucket does not have enough spaces yet in the disk block, key word overflow bucket in disk block overflows, and stores overall overflow bucket into;

The search method of hash index is following:

1) information of treating search index is confirmed a key word;

2) read mapping table;