CN111625534A

CN111625534A - Data structure for hash operation and hash table storage and query method based on structure

Info

Publication number: CN111625534A
Application number: CN202010274860.XA
Authority: CN
Inventors: 刘冬培; 刘勤让; 吕平; 沈剑良; 宋克; 陈艇; 李沛杰; 汤先拓; 张丽; 张文建
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-09-04

Abstract

The invention belongs to the field of computer scattered data structures, and particularly relates to a data structure for hash operation and a hash table storage and query method based on the structure, which are implemented by hardware by adopting a dual-port memory bank in order to ensure query efficiency, support the simultaneous reading of the contents of two addresses of the memory bank, can complete key value comparison in a determined time, and are suitable for realizing efficient query by adopting a pipeline mode; more candidate positions are provided through more Hash operations, the probability of Hash collision is reduced, the inserting, storing and updating efficiency of Hash table items is improved, the dynamic change of the capacity of the Hash table items is supported, the space waste or the performance reduction caused by table item inserting operation is avoided, and the method is suitable for the application of the Hash table items which are unknown and constantly changed; the CRC algorithm is used as the hash function, the hash calculation result has better uniqueness, and the hash calculation result can be obtained based on XOR exclusive OR operation and a parallel flow realization structure during specific realization, so that the hardware design realization is facilitated.

Description

Data structure for hash operation and hash table storage and query method based on structure

Technical Field

The invention belongs to the field of computer hash data structures, and particularly relates to a data structure for hash operation and a hash table storage and query method based on the structure, which are suitable for the optimization design and hardware realization of a high-performance hash table.

Background

The Hash table (Hash table) is a key data structure for managing the network message, and based on the message keywords, the Hash table can access the index information corresponding to the stored key values through Hash function operation, thereby greatly improving the searching efficiency. The hash function is essentially a transformation that maps elements from a larger input space to a smaller index space, and thus, it may happen that two or more keys map to the same location, a condition known as address collision (collision) in the hash table. When a conflict occurs, the insertion efficiency of the hash table will be affected. Therefore, efficient hash table design needs to reduce or solve the impact caused by the conflict as much as possible, support fast query and insert update operations, and ensure a higher storage space utilization rate.

Cuckoo Hashing (Cuckoo Hashing) is an efficient solution to the traditional key-value pair lookup problem. The basic idea of table building is to build d (d is more than or equal to 2) hash tables, wherein each hash table corresponds to one hash function. And each key value respectively calculates the storage address corresponding to the hash table according to the d hash functions, and the key value is ensured to be stored in one of the d candidate storage addresses. When searching, only d candidate positions need to be queried each time, but when an address conflict occurs during an insertion operation, the positions of a large number of existing entries may need to be moved to reorganize the entries until the conflict is eliminated. The cuckoo hash algorithm always completes a query operation within a certain time by detecting more candidate positions, but may cause performance degradation due to collision processing for inserting or updating key values.

Disclosure of Invention

Therefore, the invention provides a data structure for hash operation and a hash table storage and query method based on the structure, which reduce or solve the influence caused by conflict, support quick query and insertion update operation, ensure higher storage space utilization rate, and can be used for message fixed-length precise matching or key value searching of network communication.

According to the design scheme provided by the invention, the data structure for hash operation comprises: the hash sub-tables correspond to two hash functions, each key value obtains a corresponding candidate address through the 2d hash functions, and d is larger than or equal to 2.

As a data structure for hash operation in the present invention, further, each hash sub-table uses a dual-port storage structure to store key values.

As a data structure for hash operation of the present invention, further, a CRC algorithm is adopted as hash functions, and each hash function calculates a corresponding CRC generator polynomial.

As a data structure for hash operation of the present invention, further, a single hash function is implemented by using a parallel CRC algorithm; and dividing the CRC algorithm into multiple sections according to the length of the hash key value and the hardware circuit time sequence, and calculating by adopting a multi-stage pipeline.

Further, the present invention also provides a method for hash table storage, which is implemented based on the above data structure for hash operation, and the storage process includes the following contents: and calculating a hash value corresponding to each hash function in the 2d hash functions aiming at the element to be stored, obtaining a hash sublist storage position serving as a candidate address according to the hash value, and moving the element to be stored to an unoccupied idle storage position in the 2d candidate addresses.

As the method for hash memory table of the present invention, further, if there is no free memory location in the candidate addresses, a memory address is randomly selected from 2d candidate addresses according to whether the preset iteration condition is satisfied, the original element in the memory address is kicked out, the element to be stored is stored, the removed original element is used as a new element to be stored, and the iteration execution memory process is returned.

As the method for hash storage table of the present invention, further, the method satisfying the preset iteration condition is: less than the set maximum number of iterations.

Further, the present invention also provides a method for hash lookup table, which is implemented based on the above data structure for hash operation, and the lookup process includes the following contents: and aiming at the query elements, respectively reading the corresponding storage position content obtained by the operation of two hash functions of each hash sub-table, and matching the read content with the query elements to obtain a query result.

As the method for hash lookup table of the present invention, further, each hash sub-table adopts a dual port storage structure, if two storage addresses calculated by the hash key value according to two hash functions are different, the storage contents for matching with the lookup element are respectively read from the two different storage addresses, otherwise, the storage contents for matching with the lookup element are directly read from one storage address.

Further, the present invention also provides a computer device comprising a memory and a processor, on which a computer program capable of running on the processor is stored, which when executed by the processor implements the above method.

The invention has the beneficial effects that:

on the basis of the original cuckoo hash algorithm, each hash table corresponds to two hash functions instead of one hash function, more candidate positions need to be read during each query operation, in order to ensure the query efficiency, a dual-port memory bank is adopted for hardware implementation, the contents of two addresses of the memory bank are simultaneously read, key value comparison can be completed within a determined time, and the method is suitable for realizing efficient query in a pipeline mode; more candidate positions are provided through more Hash operations, the probability of Hash collision is reduced, the inserting, storing and updating efficiency of Hash table items is improved, the dynamic change of the capacity of the Hash table items is supported, the space waste or the performance reduction caused by table item inserting operation is avoided, and the method is suitable for the application of the Hash table items which are unknown and change continuously; the CRC algorithm is used as the hash function, CRC generating polynomials corresponding to different hash functions are different, the hash calculation result has good uniqueness, the hash calculation result can be obtained based on XOR exclusive OR operation and a parallel flow realization structure in specific implementation, hardware design realization is facilitated, and the utilization rate of a storage space is improved.

Description of the drawings:

FIG. 1 is a diagram illustrating a data structure for hash operations according to an embodiment;

FIG. 2 is a schematic diagram of an insertion operation of the cuckoo hash algorithm in the embodiment;

FIG. 3 is a schematic diagram of an element insertion updating process of the hash table method in the embodiment;

FIG. 4 is a parallel pipeline illustration of multiple hash functions in an embodiment.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

For the problems of collision, efficiency, and the like in the hash operation, the embodiment of the present invention provides a data structure for hash operation, referring to fig. 1, based on the practical application requirement of the electroencephalogram identity authentication technology, where the data structure for hash operation includes: the hash sub-tables correspond to two hash functions, each key value obtains a corresponding candidate address through the 2d hash functions, and d is larger than or equal to 2.

The table building principle based on cuckoo hash is improved, d (d is larger than or equal to 2) hash tables are built, each hash table corresponds to two hash functions, each key value calculates 2d candidate addresses according to the 2d hash functions, and as more hash functions are adopted to calculate the candidate addresses, the probability of address collision is greatly reduced during the inserting operation in the hash table storage process.

As a data structure for hash operation in the embodiment of the present invention, further, each hash sub-table stores key values by using a dual-port storage structure. The main method adopts a dual-port memory bank to realize hardware, and supports the simultaneous reading of the contents of two different addresses of the memory bank so as to ensure that the query operation can be completed within a determined time. Under the same storage space, through more hash operations and the use of a dual-port memory bank, the query operation can be completed within a certain time, the probability of hash collision during the insertion operation can be greatly reduced, and the time for completing the hash collision processing is also greatly reduced even if the hash collision occurs.

As a data structure for hash operation in the embodiment of the present invention, further, a CRC algorithm is used as a hash function, and each hash function calculates a corresponding CRC generator polynomial; and a CRC algorithm which is easy to realize by hardware and has better uniqueness is adopted as a hash function, so that the hardware is convenient to realize.

As a data structure for hash operation in the embodiment of the present invention, further, a single hash function is implemented by using a parallel CRC algorithm; and dividing the CRC algorithm into multiple sections according to the length of the hash key value and the hardware circuit time sequence, and calculating by adopting a multi-stage pipeline. The parallel flow executes the operation of 2d hash functions, the access address of each hash function can also be obtained by XOR exclusive OR operation parallel calculation, the calculation of a plurality of hash functions can be executed in a parallel flow mode, and the execution efficiency is improved

Further, an embodiment of the present invention further provides a method for hash storage table, which is implemented based on the data structure for hash operation, and the storage process includes the following contents: and calculating a hash value corresponding to each hash function in the 2d hash functions aiming at the element to be stored, obtaining a hash sublist storage position serving as a candidate address according to the hash value, and moving the element to be stored to an unoccupied idle storage position in the 2d candidate addresses. Further, if the candidate addresses have no idle storage positions, a storage address is randomly selected from the 2d candidate addresses according to whether preset iteration conditions are met, original elements in the storage address are kicked out, elements to be stored are stored, the rejected original elements serve as new elements to be stored, and the iteration execution storage process is returned. Further, the preset iteration condition is satisfied as follows: less than the set maximum number of iterations.

Further, an embodiment of the present invention provides a method for hash lookup table, which is implemented based on the data structure for hash operation, and the lookup process includes the following steps: and aiming at the query elements, respectively reading the corresponding storage position content obtained by the operation of two hash functions of each hash sub-table, and matching the read content with the query elements to obtain a query result. Further, each hash sub-table adopts a dual-port storage structure, if two storage addresses obtained by calculating the hash key value according to two hash functions are different, the storage contents used for being matched with the query element are respectively read from the two different storage addresses, and otherwise, the storage contents used for being matched with the query element are directly read from one storage address.

In order to ensure the query efficiency of the traditional cuckoo hash algorithm, the invention realizes each hash sub-table by adopting a dual-port storage structure. When the key value query operation is performed, according to the new hash table establishment rule, the key value may correspond to two storage positions in each hash sub-table. The dual-port memory structure supports a single clock cycle to initiate a read operation from one memory address or two different memory addresses and read the memory contents therein. The method is realized by adopting a dual-port RAM, and if two storage addresses obtained by calculating hash key values according to two hash functions are different from each other, the storage contents are read from the two different storage addresses for comparison; otherwise, if the two storage addresses calculated by the hash key value according to the two hash functions are exactly the same, only the storage content needs to be read from one storage address for comparison. Therefore, the storage structure of the dual-port RAM can ensure that the hash table look-up completes key value search and result matching in a determined time, and the advantages of the original algorithm are reserved.

The data structure of the cuckoo hash algorithm consists of d (d is more than or equal to 2) hash sub-tables T₁,T₂,……,T_dAnd d hash functions h₁, h₂,……,h_dComposition of each Hash sub-Table T_iCorresponding to a hash function h_i(i is more than or equal to 1 and less than or equal to d). For the element x to be searched or inserted, calculating the hash value corresponding to each hash function, and obtaining the storage positions p of the d hash sub-tables according to the hash value₁(x), p₂(x),……,p_d(x) In that respect If it is a lookup operation, d memory locations p are read_i(x) And (i is more than or equal to 1 and less than or equal to d) comparing the stored content with the query element to obtain a final search result. If the operation is an inserting operation, if at least one of the d candidate positions is empty, directly inserting the element; if the d candidate positions are occupied by other elements, namely hash collision occurs, one of the d candidate positions is randomly selected to be kicked out, x is inserted into the position, the kicked-out element y needs to be recalculated into another candidate position, if the position is empty, the element is directly inserted, otherwise, the element occupying the position is continuously kicked out, y is inserted, and the processes of kicking out and inserting are repeated until all the elements are inserted into the table, or the algorithm reaches the preset maximum iteration number. If the maximum number of iterations is reached and all elements have not yet adjusted to the proper positions, an insert update failure is declared. The common solution in case of update failure is to reselect the hash function and perform a hash build again on all elements.

Fig. 2 depicts an example of an insertion operation including two hash sub-tables, in which storage locations of elements in the two hash tables are illustrated, and arrows and directions indicate storage locations obtained by performing hash function operation on a current element. In the initial state, A, B, C, D, E, F is stored in the two hash tables as shown in fig. 2 (a). At this time, if a new element G is to be inserted, the element G calculates corresponding storage positions according to two hash functions respectively, and finds that the corresponding storage positions are occupied by the elements A and D at this time, hash collision occurs at this time, according toA conflict processing flow, namely kicking out the element A, and storing the newly inserted element G in the original storage position of the element A; after the A is kicked out, a new storage position needs to be searched; and so on, A kicks out F, F kicks out E, E kicks out C, until C is moved to a proper vacant position, the inserting operation is completed. The hash table storage state after successful insertion is shown in fig. 2 (b). The cuckoo hash algorithm is very simple in query operation, and can guarantee that the search operation is completed in a fixed time even under the worst condition, which is an advantage of the algorithm. However, for the insert operation, once a collision occurs, the elements in the hash table need to be adjusted to the appropriate empty positions, and excessive adjustment of the storage positions of the elements may cause inefficiency of the insert operation or failure of the insert update. In the aspect of the implementation structure of the traditional cuckoo hash algorithm, each hash sub-table T_i(1 ≦ i ≦ d) and hash function h_iNot one-to-one, but two hash functions h_i1And h_i2The invention provides more candidate positions through more Hash operations, greatly reduces the probability of Hash collision while ensuring the query efficiency, thereby improving the efficiency of Hash table entry insertion, and the implementation structure schematic diagram is shown in FIG. 1. When query operation is carried out, the contents of the corresponding storage positions obtained through two hash function operations need to be read respectively, and result comparison and matching operation are carried out; when the inserting operation is carried out, 2d hash functions are simultaneously calculated for the newly inserted element or the kicked element each time, and as long as the storage position corresponding to one hash function is not occupied by other elements, the newly inserted element or the kicked element can be moved to a free storage position. Referring to fig. 3, more candidate locations are provided for each insert update operation through more hash function operations, and each insert operation may select a free memory address from 2d candidate locations for insertion. And if the 2d candidate positions have no free storage positions, inserting any storage address from the 2d candidate storage positions, kicking out the original storage element of the storage position, and performing the hash operation on the kicked-out element again. The kicked elements still provide more hash operationThe probability of selecting a free storage position from more candidate positions is also higher, so that the insert update operation can be completed quickly during the hash collision processing. FIG. 3 depicts a hash element store insert update operation flow. More candidate locations reduce the probability of hash collisions, and even if hash collisions occur, the elements involved in handling hash collisions move relatively little. In order to provide a hash function which is efficient in calculation and convenient to implement, in the embodiment of the invention, a CRC algorithm is adopted as the hash function, and the calculation results of a plurality of hash functions can be realized in a parallel pipeline mode. The CRC (Cyclic Redundancy Check) algorithm is a well-known error detection algorithm in the field of data storage and data communication. The same hash element or key value, based on different CRC generator polynomials, such as CRC32, CRC16, etc., will typically result in different CRC operation results. And the CRC is adopted to calculate the hash function, so that the operation result has better uniqueness. The lower address of the CRC operation result may be generally intercepted as the hash function operation result according to the storage space of the hash sub-table. For example, the storage depth of the hash sub-table is 512, and the lower 9-bit of the CRC operation result can be selected as the hash function operation result. In the present invention, each hash function calculation corresponds to one CRC generator polynomial. The technology for realizing CRC parallel based on generating polynomial is mature, and the core of the technology is to obtain each bit of the CRC operation result in parallel through bitwise exclusive-OR (XOR) operation; thus, the method is simple and convenient. Therefore, in the invention, a single hash function operation can be realized by adopting a parallel CRC algorithm, and the CRC algorithm can be divided into a plurality of sections and calculated by adopting a pipeline mode in consideration of the length of the hash key value and the design requirement of the hardware circuit time sequence; furthermore, the computation of multiple hash functions may also be performed in parallel. Referring to fig. 4, in the parallel pipeline implementation manner of multiple hash operations based on the CRC algorithm, the access address of each hash function may also be obtained by XOR exclusive or operation parallel computation, the computation of multiple hash functions may be performed by using a parallel pipeline manner, which is illustrated by using 5-level pipeline, and thus, the implementation is easy for hardware, and the uniqueness is good.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

Based on the foregoing system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the system as described above.

Based on the above system, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above system.

The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the system embodiment, and for the sake of brief description, reference may be made to the corresponding content in the system embodiment for the part where the device embodiment is not mentioned.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing system embodiments, and are not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and system may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data structure for hash operations, comprising: the hash sub-tables correspond to two hash functions, each key value obtains a corresponding candidate address through the 2d hash functions, and d is larger than or equal to 2.

2. The data structure for hash operations of claim 1, wherein each hash sub-table employs a dual port storage structure for key value storage.

3. The data structure for hash operations of claim 1, wherein a CRC algorithm is employed as the hash functions, each hash function computing a corresponding CRC generator polynomial.

4. The data structure for hash operations of claim 3, wherein a single hash function is implemented using a parallel CRC algorithm; and dividing the CRC algorithm into multiple sections according to the length of the hash key value and the hardware circuit time sequence, and calculating by adopting a multi-stage pipeline.

5. A method for hashing a storage table, wherein the storage process is implemented based on the data structure for hashing according to claim 1, and comprises the following steps: and calculating a hash value corresponding to each hash function in the 2d hash functions aiming at the element to be stored, obtaining a hash sublist storage position serving as a candidate address according to the hash value, and moving the element to be stored to an unoccupied idle storage position in the 2d candidate addresses.

6. The hash memory table method according to claim 5, wherein if there is no free memory location in the candidate addresses, a memory address is randomly selected from the 2d candidate addresses according to whether a preset iteration condition is satisfied, the original element in the memory address is kicked out, the element to be stored is stored, the knocked-out original element is used as a new element to be stored, and the iterative execution memory process is returned.

7. The hash table storage method according to claim 6, wherein the predetermined iteration condition is satisfied as follows: less than the set maximum number of iterations.

8. A method for hashing a lookup table, wherein the lookup process is implemented based on the data structure for hash operation of claim 1, and comprises the following steps: and aiming at the query elements, respectively reading the corresponding storage position content obtained by the operation of two hash functions of each hash sub-table, and matching the read content with the query elements to obtain a query result.

9. The method for hashing a lookup table according to claim 8, wherein each hash sub-table adopts a dual port storage structure, if two storage addresses calculated by the hash key value according to two hash functions are different, the storage contents for matching the lookup element are respectively read from the two different storage addresses, otherwise, the storage contents for matching the lookup element are directly read from one storage address.

10. A computer device comprising a memory and a processor, a computer program being stored on the memory and being executable on the processor, wherein the processor implements the method of any one of claims 5 to 9 when executing the program.