CN108196662B

CN108196662B - Hash partitioning accelerator

Info

Publication number: CN108196662B
Application number: CN201711469302.3A
Authority: CN
Inventors: 吴林阳; 郭雪婷; 陈云霁
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2021-03-30
Anticipated expiration: 2037-12-28
Also published as: CN108196662A

Abstract

A hash partitioning accelerator configured to be integrated on a memory for accelerating processing of hash join partitioning stages, the hash partitioning accelerator comprising: the hash unit is used for reading a plurality of tuples in the partitioned relation table from the memory, and then processing keys of the tuples in parallel to generate a plurality of hash indexes; the histogram unit is used for parallelly updating a plurality of copies of the histogram data stored in the histogram unit according to the plurality of hash indexes, and integrating the updated copies into a histogram data form with data consistency; and the Shuffle unit (Shuffle unit) is used for determining the position of each tuple stored in the target array according to the plurality of hash indexes, copying the tuple in the relation table to the target array and realizing the division of the relation table. The accelerator of the present disclosure significantly alleviates the total energy consumption of the partitioned stage by accelerating through the memory side.

Description

Hash partitioning accelerator

Technical Field

The present disclosure relates to the field of computer systems, and further relates to a hash partitioning accelerator.

Background

A first consideration in designing modern computer systems is power consumption. To improve energy efficiency, hardware accelerators such as Field Programmable Gate Arrays (FPGAs), Graphics Processors (GPUs), and custom accelerators have been widely used in the industrial field. With the advent of processing technologies near data, the integration of hardware accelerators into Dynamic Random Access Memory (DRAM) stacks to reduce the cost of data movement has become a new system design consideration. The basic idea is to vertically integrate some accelerator-containing logic die and multiple DRAM die into one chip using 3D stacking technology. However, due to limitations in area, power consumption, heat dissipation, and manufacturing of 3D stacked DRAMs, the number and types of accelerators that can be integrated into DRAMs is limited. Thus, given an accelerated target application, it is critical to determine which portions thereof are best suited for acceleration in DRAM.

Disclosure of Invention

In view of the above, it is an object of the present disclosure to provide a hash partitioning accelerator to solve at least some of the above technical problems.

The present disclosure provides a hash partitioning accelerator configured to be integrated on a memory for accelerating a processing of a hash join partitioning (partitioning) phase, the hash partitioning accelerator comprising:

a hash unit (hash unit) for reading a plurality of tuples in the partitioned relation table from the memory, and then processing keys of the tuples in parallel to generate a plurality of hash indexes;

a histogram unit (histogram unit) configured to update, in parallel, a plurality of copies of the histogram data stored in the histogram unit according to the plurality of hash indices, and integrate the updated copies into a histogram data table with data consistency;

and the mixed arrangement unit (Shuffle unit) is used for determining the position of each tuple stored in the target array according to the plurality of hash indexes, copying the tuples in the relation table to the target array and realizing the division of the relation table.

In a further embodiment, the memory is 3D stacked DRAM, and the hash partitioning accelerator is configured to be integrated onto a logical layer of the 3D stacked DRAM.

In a further embodiment, the hash unit includes a plurality of parallel processing units, each for processing a key of each tuple to generate a hash index corresponding to each tuple.

In a further embodiment, the system further comprises a plurality of multiplexers, the number of multiplexers is the same as that of the parallel processing units, the multiplexers are connected to the rear end of each parallel processing unit, and the multiplexers are used for selecting the output of the hash index as a histogram unit or a shuffling unit.

In a further embodiment, the histogram unit includes a parallel increment unit for updating the respective copies of the histogram data in parallel according to the plurality of hash indexes, respectively.

In a further embodiment, the histogram unit includes a plurality of first local memories for storing respective copies of the pre-update and post-update histogram data; preferably, the number of copies is 16, and the corresponding number of the first local storage is 16.

In a further embodiment, the histogram unit further comprises a specification unit for integrating the updated forms from the respective first local memories into a data consistent form.

In a further embodiment, the mixing and arranging unit comprises: a plurality of parallel address reading subunits: reading a target address from a target address array according to each hash index; a conflict processing (DECONF) subunit, for conflicting target addresses, generating an offset based on the original target address and also generating a count value for the same target address, according to the plurality of target addresses; a SCATTER (SCATTER) subunit, configured to move the tuple to a correct location according to the offset and the original target address, and move the tuple to a target address without offset if there is no conflict; an UPDATE (UPDATE) subunit that UPDATEs the target address based on the count value.

In a further embodiment, the conflict handling sub-unit comprises a multiplexed XNOR network having inputs for conflicting target addresses and outputs for the offset and count values respectively for the target addresses.

In a further embodiment, each of the address reading subunits includes a second local memory for storing a target address.

In a further embodiment, a programming interface is included through which the histogram unit and the shuffling unit are externally operated

By integrating the hash partitioning accelerator into memory, the power consumption at the partitioning stage can be reduced in various ways, thereby reducing the overall power consumption.

By integrating the accelerator into the DRAM, costly round-trip data transfers between the CPU and the DRAM can be avoided, thereby directly reducing data movement energy consumption.

By integrating the accelerator into the DRAM, the large amount of pipeline stall caused by slow memory accesses is reduced, and thus improvements in data access (by moving computations closer to data locations) also help to reduce power consumption due to pipeline stall.

The use of a custom hash partitioning accelerator can significantly reduce computational power consumption.

Drawings

Fig. 1 is a schematic diagram of a location of a hash partitioning accelerator on a 3D stacked DRAM according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a hash partition accelerator architecture according to an embodiment of the disclosure.

Fig. 3 is a schematic diagram of an XNOR network in a decron cell according to an embodiment of the present disclosure.

FIG. 4 is a schematic diagram of a 3D stacked DRAM configuration according to an embodiment of the present disclosure.

FIG. 5 is a schematic diagram of energy-delay product of shuffling operation with different designs of embodiments of the present disclosure.

FIG. 6 is a schematic diagram of a hybrid acceleration system of an embodiment of the present disclosure.

FIG. 7 is a schematic diagram illustrating a comparison of the operation of an Intel Haswell and Xeon Phi processor on a Hash partition accelerator according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings. The drawings attached hereto are simplified and provided as illustrations. The number, shape, and size of the components shown in the drawings may be modified depending on the actual situation, and the arrangement of the components may be more complicated. Other aspects of practice or use can be made in the present disclosure, and various changes and modifications can be made without departing from the spirit and scope defined in the present disclosure.

According to the basic concept of the present disclosure, a hybrid acceleration system for a hash partitioning accelerator and hash join (hash joins) is proposed, which performs appropriate acceleration task division between CPU and DRAM. The system comprises an accelerator customized on the memory side (namely, a hash partitioning accelerator) and a host processor (for example, a SIMD accelerator), and the system can improve the overall energy efficiency of the hash connection.

The above-described concept of the hash partitioning accelerator and the hybrid acceleration system is based on the following studies by the present inventors: detailed performance analysis and energy consumption analysis are performed on a hash join algorithm (such as optimized version of radix join algorithm (PRO)), the algorithm is specially optimized for a modern multi-core system, the algorithm mainly comprises three main execution stages, namely partition (partition), build (build) and probe (probe), and the partition stage can be further divided into four stages: histogram (Local histogram), prefix sum (prefix sum), output address (output addressing), and data shuffling (data shuffling). The inventor of the present disclosure has analyzed that the hash join is inherently memory-limited, and by performing energy consumption analysis, more than 50% of the total energy is used for data movement and pipeline blocking in the partitioning stage, which can be significantly alleviated by memory-side acceleration. Only about 15% of the energy is used for data movement and pipeline blocking during the build and probe stages, which can still be accelerated by existing CPU-side accelerators (e.g., SIMD units). In the partition stage, the histogram and data mixing stage accounts for more than 99% of the total execution time. The reason is that these two phases have costly irregular memory accesses. Therefore, the present disclosure primarily contemplates processing techniques that employ close proximity data for these two phases. Fig. 1 is a schematic diagram of a location of a hash partitioning accelerator in a memory (e.g., 3D stacked DRAM) according to an embodiment of the present disclosure. Referring to fig. 2, according to an aspect of the embodiments of the present disclosure, there is provided a hash partitioning accelerator configured to be integrated on a memory and used for accelerating a hash join partitioning (partitioning) phase, the hash partitioning accelerator including:

the hash unit (a) is used for reading a plurality of tuples in the partitioned relation table from the memory and then processing keys of the tuples in parallel to generate a plurality of hash indexes;

a histogram unit (b) for updating a plurality of copies of the histogram data stored in the histogram unit in parallel according to the plurality of hash indexes, and integrating the updated copies into a data consistency form;

and the mixed arrangement unit (c) (Shuffle unit) is used for determining the position of each tuple stored in the target address array according to the plurality of hash indexes, copying the tuple in the relation table to the target array and realizing the division of the relation table.

In some embodiments, the memory is 3D stacked DRAM, and the hash partitioning accelerator is configured to be integrated onto a logical layer of the 3D stacked DRAM. Vertical 3D die stacking techniques allow multiple memory dies to be directly superimposed on a processor die, thereby increasing memory bandwidth. These die are integrated with short, fast, dense through-silicon vias (TSVs), providing very high internal bandwidth.

For hash unit (a), it can read multiple tuples in the partitioned relational table from the memory, and then process the multiple tuples' keys in parallel to generate multiple hash indexes. In this stage, the hash unit (a) reads multiple tuples in the relational table from the memory in a streaming access manner, and then processes their keys in parallel to generate the hash index in a shift or mask manner.

In some embodiments, the hash unit (a) includes a plurality of parallel processing units, each for processing the key of each tuple to generate the hash index corresponding to each tuple. Through parallel processing, the efficiency of hash connection can be improved, and the parallel processing can be realized through the acceleration of a memory terminal.

In some embodiments, the hash unit further includes a plurality of multiplexers equal to the number of parallel processing units, and the multiplexers are connected to the rear end of each parallel processing unit and used for selecting the output of the hash index as a histogram unit or a shuffling unit. Since the hash unit (a) is multiplexed by the histogram unit (b) and the shuffling unit (c), a multiplexer 11(MUX) is added for deciding an output target of the hash index.

For histogram unit (b), it can scan the relation table of blocks, and build a histogram array. The unit mainly realizes the function of updating the histogram data in parallel according to the hash index value, and can comprise two stages, namely a parallel Increment (INC) stage and a final Reduction (RED) stage. In the parallel phase, the hash index generated by the hash unit (a) is used to update the current histogram data value in each corresponding copy. After all keys read from the memory are processed, the last specification stage integrates all the LMs to obtain a complete form keeping data consistency.

In some embodiments, the histogram unit includes a parallel increment unit 21 for updating the copy of each histogram data in parallel according to the plurality of hash indexes, respectively. Further, the histogram unit (b) may include a plurality of first local memories 22 for storing respective copies of the histogram data before and after the update; preferably, the number of copies is 16, and the corresponding number of the first local storages 22 is 16.

In some embodiments, the histogram unit further comprises a specification unit 23 for integrating the updated forms from the respective first local memories 22 into a data consistent form. The specification unit 23 is disposed at the back end of the parallel increment unit 21 and the first local storage 22, and is configured to integrate data in all the first local storage 22 after all the keys are processed, so as to obtain a complete histogram data sheet maintaining data consistency.

The shuffling unit (c) may be configured to write multiple tuples in the partitioned relational table to the target addresses in the target array at the same time, and then update the target addresses in the target array. If tuples in multiple processing paths have the same destination address, destination address collision issues need to be handled.

In some implementations, the shuffling unit (c) may include a plurality of parallel address reading sub-units 31 that read the target address from the target address array according to each of the hash indexes; a conflict processing (DECONF) subunit 32, which generates an offset based on the original target address for the target addresses with conflicts according to a plurality of target addresses, and also generates a count value of the same target address; a SCATTER (SCATTER) subunit 33, configured to move the tuple to a correct location according to the offset and the original target address, and if there is no conflict, move the tuple to a target address without offset; and an UPDATE (UPDATE) subunit that UPDATEs the target address according to the count value.

In some embodiments, the conflict handling subunit 32 comprises a multiplexed XNOR network having inputs for conflicting target addresses and outputs offset and count values, respectively, for the index address. As shown in FIG. 3, which is an example of a typical binary network, the values d0, d1, d2 and d3 of four target addresses are read out from the dst array in parallel. To calculate the total number of d0, e.g., count (d0), first d0 is XORed with d1, d2 and d3, respectively, and then all XORed values are summed. Similarly, the count (d1) was calculated by adding xnor (d1, d0), xnor (d1, d2) and xnor (d1, d 3). Calculating the target address offset may also be accomplished by multiplexing XNOR networks. For example, offset (d1) is the sum of xnor (d1, d2) and xnor (d1, d 3).

The four sub-units of the mixed-arranging unit (c) correspond to four specific stages of implementation, and can be respectively as follows:

in the first stage, the address reading subunits 31 parallelly read a plurality of target addresses from the target array according to the hash index; each of the address reading sub-units 31 includes a second local memory for storing a target address.

In the second stage, the conflict processing subunit 32 detects a conflicting target address among the paths, generates an offset based on the original target address, and also generates a count value of the same target address to update the target array. In the third stage, the scatter subunit 33 moves the tuple to the correct position, which is calculated according to the offset of the second stage and the original target address stored in the target array in the first stage;

in the fourth stage, the update subunit 34 updates the target address according to the count value generated in the DECONF stage.

In some implementations, the hash partitioning accelerator also includes a programming interface through which the histogram unit and the shuffle unit can be externally operated. The programming interface may be built on programming libraries known in the art. The library functions used include a memory management library and an accelerator control library. The accelerator control library can control the operation of the histogram unit and the shuffling unit through expansion. Thus, one skilled in the art can readily use a hash partitioning accelerator in a program. The programming interface of the embodiment of the invention is a sequential programming model essentially, so the programming burden of a heterogeneous system can be reduced.

With continuing reference to fig. 1 and fig. 2, an embodiment of the present invention further provides a device including a hash partitioning accelerator and a memory, including:

the memory comprises a data storage area and a logic area;

and the Hash partitioning accelerator is integrated on the logic area of the memory and is used for accelerating the processing of a Hash connection partitioning (partition) stage.

For the hash partitioning accelerator, design can be performed with reference to the hash partitioning accelerator in the above embodiments, and details are not described herein. The configuration relationship between the two is mainly introduced here, and the corresponding design manner is designed to improve the overall processing efficiency and reduce the overall power consumption.

In some embodiments, the memory may be 3D stacked DRAM, and the hash partitioning accelerator is configured to be integrated onto a logical layer of the 3D stacked DRAM. For each 3D DRAM bank, multiple such accelerators may be integrated for accelerating the partition phase. Each 3D stacked DRAM contains one or more vaults and each vault is accessed by a logic level vault controller, a hash partitioning accelerator can be attached to the vault controller through semiconductor processes. The benefit of integrating the accelerator in the DRAM is to reduce the power consumption of the partition phase from all sides (reducing the power consumption for data movement, pipeline blocking, computational power consumption), thus reducing the overall power consumption.

As shown in fig. 2, each hash partitioning accelerator mainly includes the following parts: a hash unit (hash unit), a histogram unit (histogram unit) and a shuffle unit (shuffle unit), which access the DRAM layer through the vault controller (the vault controller accesses the upper DRAM layer through the TSV), and are connected to the switch circuit of the logic layer. The logic layer comprises a vault control circuit, the hash unit, the histogram unit and the mixed arrangement unit are all electrically connected with the vault control circuit, and the DRAM layer is accessed through the control circuit.

In some embodiments, the degree of parallelism in the histogram cells may be 16. The specific analysis is as follows:

in the design process of the hash partitioning accelerator, there are various design choices such as parallelism and frequency, and 3D stacked DRAMs have different configurations. Fig. 4 lists three possible 3D stacked DRAM configurations, namely high configuration (HI), medium configuration (MD) and low configuration (LO). The internal bandwidth ranges between 860GB/s to 360 GB/s. The design space consisting of parallelism (1 to 512), operating frequency (0.4GHz to 2.0GHz) and DRAM configuration for each vault was investigated to find a balance between performance and power consumption.

For histogram cell operation, using the energy-delay product (EDP) as a criterion, the EDP reaches a minimum value at parallelism around 32 in all DRAM configurations. For example, for a HI configuration, at 1.2GHz, an optimized EDP requires a configuration of parallelism 64. Generally, MD and LO DRAM configurations are more energy efficient (per vault) than HI configurations, especially with a larger degree of parallelism like 64. FIG. 5 is a diagram of energy-delay product of shuffling operation in different designs. It shows the shuffling operation energy-delay product (EDP) and area for each vault. The parallelism to achieve the best EDP is between 32 and 128. For example, at 2GHz, the EDP with parallelism of 32 is 1.77 times the EDP with parallelism of 512. In all of these configurations, the configuration that achieves the best EDP has a parallelism of 64, a frequency of 1.2GHz and a DRAM configuration of LO. Under the optimal configuration, the area of each vault is 0.18mm²。

Combining histogram operation and shuffle operation, when the size of the input is 128M, the optimal design decision for HPA is parallelism-16, frequency-2.0 GHz, DRAM configuration-HI. Furthermore, the corresponding area is only 1.78mm²The power consumption of the accelerator (without DRAM power consumption) is only 7.52 w.

As shown in fig. 6, according to another aspect of the embodiment of the present invention, there is further provided a hash mixture acceleration system, including:

a hash partitioning accelerator configured to be integrated on a memory for accelerating a processing hash join partitioning (partitioning) phase, the hash partitioning accelerator comprising:

a Shuffle unit (Shuffle unit) for determining the position of each tuple stored in the target address array according to the hash indexes, and copying the tuple in the relation table to the target array to realize the division of the relation table;

and the host processor is used for processing a build (built) stage and a probe (probe) stage of the hash connection.

The host processor may communicate with the 3D DRAM bank through a high speed connection (used in conventional HMC systems) or an interposer. The host processor can enhance its performance by introducing an accelerator, thereby improving the performance of most applications. In a hybrid acceleration system, the host processor is primarily focused on the acceleration construction and probing phases.

In a further embodiment, the hash mix acceleration system further comprises a bus or interposer through which the hash partition accelerator and host processor communicate with the memory.

The setting relationship between the hash partitioning accelerator and the 3D stacked DRAM and the specific setting of the hash partitioning accelerator can be performed with reference to the above embodiments, and are not described herein again. The present disclosure is generally directed to a host processor and a method for performing hash operations by associating a host processor with a hash partition accelerator. The whole system can improve the whole energy efficiency of Hash connection.

In some embodiments, the host processor includes a SIMD unit, a custom accelerator or GPU, etc., for processing the construction and probe phases of the hash join; preferably, the host processor comprises a SIMD unit.

The embodiment of the invention also provides a method for performing hash connection by applying the system, which comprises the following steps:

performing a partitioning operation comprising: reading a plurality of tuples in a partitioned relation table from a memory by using a hash unit, and then processing keys of the tuples in parallel to generate a plurality of hash indexes; the histogram unit is applied to update a plurality of copies of the histogram data stored in the histogram unit in parallel according to the plurality of hash indexes, and the updated copies are integrated into a form with data consistency; the application mixed-arranging unit determines the position of each tuple stored in the target address array according to the plurality of hash indexes, copies the tuple in the relation table to the target array and realizes the division of the relation table;

constructing operation is carried out, and the application host processor constructs a hash table in the memory by using the smaller relation table;

and performing detection operation, and checking the hash table on the larger relation table by the application host processor to complete connection.

As shown in fig. 7, by comparing the performance, energy efficiency and EDP of HPA and reference platform. On average, the performance, power consumption and EDP of the hash partitioning accelerator are improved by 30, 90 and 2725 times, respectively, compared to the Intel hashwell processor. The EDP was even more than 6000 times improved compared to the Xeon Phi processor. Significant improvements in performance and energy efficiency mainly result from the custom accelerators and the high bandwidth provided by the 3D stacked DRAM.

The hybrid acceleration system of embodiments of the present disclosure does not require additional chip area because the hash partitioning accelerator can be easily placed into the logical layer of existing 3D stacked DRAMs. In terms of power consumption, the proposed system only needs to increase the power consumption by 7.52W, and can obtain 6.70 and 47.52 times of performance and energy efficiency improvement relative to Haswell.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A hash partitioning accelerator configured to be integrated on a memory, the memory being a 3D stacked DRAM, the hash partitioning accelerator configured to be integrated on a logical layer of the 3D stacked DRAM;

for accelerating the processing of a hash join partitioning phase, the hash partitioning accelerator comprising:

the hash unit is used for reading a plurality of tuples in the partitioned relation table from the memory and then processing keys of the tuples in parallel to generate a plurality of hash indexes;

the histogram unit is used for parallelly updating a plurality of copies of the histogram data stored in the histogram unit according to the plurality of hash indexes and integrating the updated copies into a histogram data form with data consistency;

the mixed arrangement unit is used for determining the position of each tuple stored in the target array according to the plurality of hash indexes, copying the tuples in the relation table to the target array and realizing the division of the relation table;

the hash unit comprises a plurality of parallel processing units and a plurality of selectors which are the same as the parallel processing units in number, the parallel processing units are respectively used for processing the keys of the tuples to generate hash indexes corresponding to the tuples, and the selectors are connected to the rear ends of the parallel processing units and are used for selecting the output of the hash indexes to be histogram units or mixed arrangement units.

2. The hash partitioning accelerator of claim 1, wherein the histogram unit comprises a parallel delta unit configured to update respective copies of the histogram data in parallel according to the plurality of hash indices, respectively.

3. The hash partitioning accelerator of claim 1, wherein the histogram unit comprises a plurality of first local memories for storing respective copies of the histogram data before and after the update.

4. The hash partitioning accelerator of claim 3, wherein the number of copies is 16 and the corresponding first number of local memories is 16.

5. The hash partitioning accelerator of claim 2, wherein the histogram unit further comprises a reduction unit to integrate the updated forms from the respective first local memories into data consistent forms.

6. The hash partitioning accelerator of claim 1, wherein the shuffling unit comprises:

a plurality of parallel address reading subunits: reading a target address from a target address array according to each hash index;

the conflict processing subunit generates an offset based on the original target address for the target address with conflict according to the plurality of target addresses, and also generates a count value of the same target address;

the dispersion subunit is used for moving the tuple to a correct position according to the offset and the original target address, and if no conflict exists, moving the tuple to a target address without offset;

and the updating subunit updates the target address according to the count value.

7. The hash partitioning accelerator of claim 6, wherein the collision processing subunit comprises a multiplexed XNOR network that inputs the target addresses of the collisions and outputs the offset and count values, respectively, as index addresses.

8. The hash partitioning accelerator of claim 6, wherein each of the address read subunits comprises a second local memory for storing a target address.

9. The hash partitioning accelerator of claim 1, further comprising a programming interface through which the histogram unit and the shuffle unit are externally operated.