WO2015176315A1 - Hash join method, device and database management system - Google Patents

Hash join method, device and database management system Download PDF

Info

Publication number
WO2015176315A1
WO2015176315A1 PCT/CN2014/078304 CN2014078304W WO2015176315A1 WO 2015176315 A1 WO2015176315 A1 WO 2015176315A1 CN 2014078304 W CN2014078304 W CN 2014078304W WO 2015176315 A1 WO2015176315 A1 WO 2015176315A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
hash
data
preset
original data
Prior art date
Application number
PCT/CN2014/078304
Other languages
French (fr)
Chinese (zh)
Inventor
桑永嘉
李俊
施会华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2014/078304 priority Critical patent/WO2015176315A1/en
Priority to CN201480037464.8A priority patent/CN105359142B/en
Publication of WO2015176315A1 publication Critical patent/WO2015176315A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • Hash connection method device and database management system
  • the present invention relates to the field of database technologies, and more particularly to a hash connection method, apparatus, and database management system.
  • BACKGROUND With the development and application of database technology, the amount of data stored in a database has transitioned from megabytes (M) and gigabytes (G) to the current terabytes (T) and gigabytes ( P). Based on the amount of data that can be stored in the current database, the amount of data that the user needs to face in the process of querying the database is G, T, or even P. In the case of querying such a large amount of data, it is necessary to satisfy the fast response of the query, which poses a great challenge to the processing performance of the database, and the database performance is crucial in the query process.
  • the basic methods for implementing j 0 i n operations in the database are mainly Hash Join, Merge Join, and the improved Radix Join algorithm for Grace Join.
  • the packet and Join are mainly included.
  • TLB Translation Lookaside Buffer, page table buffer, TLB entry refers to the buffer in the LTB.
  • the severe TLB miss caused by the page table entry) (there is no required table page in the TLB).
  • the existing query uses the multi-way packet method to reduce the TLB miss in the grouping phase.
  • the most common query process is as follows: First, grouping is performed by means of multiplexed packets, and the raw data is hashed in each grouping process, and then, after obtaining the multiplexed group, the Join operation is performed.
  • an object of the embodiments of the present invention is to provide a hash connection method, apparatus, and database management system, which overcomes the problem of wasting computing resources in the existing database query process.
  • the embodiment of the present invention provides the following technical solutions:
  • a first aspect of the embodiments of the present invention provides a hash connection method, which is applied to a database, and includes: receiving a structured query language SQL statement including a connection Join operation, and parsing and acquiring at least two target data groups to be connected;
  • N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule wherein, in each hash packet, calculating the original data in the data segment based on the first hash packet
  • the hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group is classified according to each original.
  • the positions of the data in the target data group are sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;
  • the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
  • the Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.
  • the performing, by using the preset grouping rule, the first hash packet in the N times hash packet for each data segment in each target data group includes: The hash value of the original data contained in the data segment, and the bit value is used to represent the calculated hash value;
  • the original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and the original data divided in the same group is in the same position in the target data group according to each original data. Sort and save within the group;
  • the unspecified bit bits of the hash value corresponding to each original data are associated with the original data and saved;
  • the performing the second to the nth hash packets in the N hash packets in the data segment in each target data group in sequence based on the preset grouping rule includes: Hash the original data in any group obtained after the last hash grouping, n is included in N, and a positive integer greater than 2 includes:
  • the original data corresponding to the hash value of the same bit in the current hash grouping process is divided into the same group based on the unspecified bit in the last hash packet associated with the original data in the current group. Internally, and sorting and saving the original data divided into the same group in the same group according to the position of each original data in the target data group;
  • the first type of the preset grouping rule involved in the first aspect of the embodiment of the present invention includes: preset the number of hash packets N, or preset the total number of packets S, or preset the number of hash packets N and the total number of preset packets S ;
  • the preset grouping rule is the preset hash packet number N
  • the data segments in each of the target data are hash-grouped in turn until the N-th hash packet is completed;
  • the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the number of packets of each of the target data groups is equal to the preset number of packets;
  • the preset grouping rule is a preset hash packet number N and a preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, and is sequentially used in each of the target data groups.
  • the data segment is hashed until the hash packet is completed N times;
  • the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;
  • the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, sequentially for each of the target data
  • the data segment in the group is hashed until the hash packet is completed N times and the number of packets of each target data group is equal to the total number of preset packets S;
  • N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n;
  • S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2;
  • the priority of the preset hash packet number N and the preset packet total S is determined by the storage size of the TLB and the size of the cache.
  • the second preset packet rule involved in the first aspect of the embodiment of the present invention includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, N The value is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2; When the data segment in each of the target data is hashed, the packet is grouped according to the preset number of packets of each hash packet, so that the last packet number is equal to the preset hash packet number, and the total number of the divided groups is equal to the preset. The total number of groups.
  • each target data group into multiple data segments by using a vector vector is:
  • the vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;
  • the number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments.
  • each original data corresponding to a hash value having the same value in a specified bit position is divided into the same group, and each original data divided into the same group is divided.
  • Sorting and saving in the same group according to the position of each original data in the target data group includes: searching for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and each original The data is divided into the same group, wherein the bit size required for the current hash packet is specified according to the size of the database cache cache and the storage size of the page table buffer TLB; traversing the subscripts of each original data divided in the same group, The subscripts of the respective original data are used to identify the location of each original data in the target data group;
  • the original data corresponding to each subscript is arranged from small to large;
  • Each raw data is written into the same group and saved in the order from small to large.
  • the hash with the same value in the specified bit position in the current hash grouping process is used.
  • the original data corresponding to the value is divided into the same group, and is divided in
  • Each raw data in the same group is sorted and saved in the same group according to the position of each raw data in the target data group, including:
  • the storage size of the TLB is determined;
  • the subscripts of the respective original data are used to identify the locations of the respective original data in the target data group;
  • the original data corresponding to each subscript is arranged from small to large;
  • Each raw data is written into the same group and saved in the order from small to large.
  • the two groups work as a pair of raw data join operations, and perform the Join operation on the original data in each of the two target data groups;
  • the manner in which the two groups are a pair of original data join operations includes:
  • a second aspect of the embodiment of the present invention provides a hash connection apparatus, which is applied to a database, and includes: a receiving unit, configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two to be connected. Target data set;
  • a dividing unit configured to divide each target data group into a plurality of data segments by using a vector vector
  • a grouping unit configured to sequentially perform N hash hash packets for each data segment in each target data group based on a preset grouping rule, where, in each hash packet, the data segment is calculated based on the first hash packet
  • the raw data is represented by the bit value, and the original data corresponding to the hash value of the same bit position in the current hash grouping process is divided into the same group, and each original is divided into the same group.
  • Data, sorted and saved in the same group according to the position of each original data in the target data group, and N takes a positive integer greater than or equal to 1;
  • a sorting unit configured to obtain a group obtained after N times hash grouping for each target data group, in which the hash value corresponding to the original data included in each group is from small to large for each d, Group sorting;
  • a connecting unit configured to perform a Join operation on the original data in each group obtained after the N hash packets in the target data groups to be connected in the order of the two connected data groups.
  • a third aspect of the embodiments of the present invention provides a database management system, which is applied to a database, and includes:
  • a memory having a storage medium, wherein the memory stores a program for performing a database query; and a processor connected to the memory via a bus, when the database query is executed, the processor invokes a database query program stored in the memory And executing the database query procedure according to a hash connection method provided by the first aspect of the embodiments of the present invention described above.
  • the embodiment of the present invention discloses a hash connection method, device and database management system as compared with the prior art.
  • the target data group is grouped into a plurality of data segments, and then the target data group to be connected is divided into multiple data segments by using a vector vector.
  • the value of the specified bit in the current hash grouping process is the same.
  • the original data corresponding to the hash value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group.
  • the embodiment of the present invention can perform hash packet processing on a plurality of original data simultaneously by using a vector as a quantity unit and a hash packet by using a specified bit in the hash grouping process, and does not need to repeatedly calculate the original in the process of multiple hash packets.
  • the hash value of the data which reduces the cache miss cache miss, also eliminates the need to repeatedly calculate the hash value to avoid the waste of computing resources.
  • the sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.
  • FIG. 1 is a flowchart of a hash connection method according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of a third-time hash packet disclosed in Example 4 of the third embodiment of the present invention
  • FIG. 3 is a schematic diagram of the same original data included in each data segment disclosed in Embodiment 4 of the present invention
  • FIG. 5 is a schematic diagram of a grouping of raw data in a data segment according to Embodiment 4 of the present invention
  • FIG. 6 is a flowchart of dividing a group in a second to Nth hash grouping process according to Embodiment 4 of the present invention.
  • FIG. 8 is a schematic structural diagram of a hash connection apparatus according to Embodiment 5 of the present invention.
  • FIG. 9 is a schematic structural diagram of a database management system according to Embodiment 5 of the present invention. detailed description For the purposes of reference and clarity, the description, abbreviations or abbreviations of the technical terms used below are summarized as follows:
  • TLB Translation Look aside Buffer
  • page table buffer page table entry
  • TLB entry refers to the page table entry cached in LTB
  • Cache miss means that the requested data is not in the memory layer to be accessed.
  • an embodiment of the present invention provides a hash connection method, apparatus, and data management system, which can implement a hash packet by using a vector vector as a quantity unit and using a specified bit bit in a current hash grouping process in a subsequent grouping process.
  • each hash group is divided into the original data in each group, so that the original data in each group obtained after grouping multiple data segments is locally ordered, when the local ordered original data is joined.
  • the sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.
  • the first embodiment of the present invention discloses a hash connection method, and the method is applied to a database.
  • the process is as shown in step S101 to step S105 in FIG.
  • Step S101 Receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;
  • step S101 is executed, and the received SQL query statement containing the Join operation is parsed by the database, and at least two target data to be connected are obtained.
  • Group That is to say, two target data groups to be connected are paired, and at least two target data groups to be connected appear in the process of parsing, that is to say, the target data groups to be connected are parsed in pairs.
  • Step S102 dividing each target data group into a plurality of data segments of the determined data by using a vector vector as a quantity unit;
  • step S102 the same operation is performed on the parsed pair of target data groups to be connected, and a target data group is taken as an example in the process of dividing the data segments.
  • the current target data set is divided by the vector vector.
  • the unit of the vector refers to how many pieces of raw data are contained in a vector as a fixed unit.
  • the target data group is divided into a plurality of data segments by using the vector quantity unit, that is, one data segment corresponds to one vector.
  • the maximum number of original data that can be included in one data segment is one vector unit, and the target data group is divided into multiple data segments, and the divided data segments are included in the data segment.
  • the number of raw data is usually the same.
  • the number of units is limited to the quantity unit vector.
  • each target data group to be connected can be divided into a plurality of data segments.
  • a vectorization method is used.
  • a vector is used as a quantity unit, and a hash value is simultaneously calculated for the original data in the vector, and then several original data in the same group are compared. Write the corresponding group at one time, which reduces the cache miss and improves the join performance.
  • Step S103 Perform N times hash grouping on the data segments in each target data group in sequence according to the preset grouping rule, where, in each hash grouping, calculate the original data in the data segment based on the first hash group.
  • the hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group, according to each The position of the original data in the target data group is sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;
  • each target data is sequentially performed based on a preset grouping rule.
  • the data segments in the group are hashed N times.
  • the hash packet is ended to the end of the last data segment.
  • the hash value is calculated simultaneously for all the original data contained in the data segment, and the hash value of each original data is represented by a bit, and the bit is installed.
  • the number of bits in the database itself is determined by the maximum number of CPUs currently CPU of the computer.
  • the hash value corresponding to the original data calculated during the first hash grouping process is represented by a 32-bit bit. If the computer on which the database is currently installed is 64-bit, the hash value corresponding to the original data calculated during the first hash grouping is represented by a 64-bit bit.
  • the comparison is performed on the specified bit of each hash value represented by the bit, or traversed, or searched at the specified bit.
  • the hash value with the same value is set, and the original data corresponding to the hash value is divided into the same group. For example, if the number of bits required for the first hash packet is 2 bits, then the highest bit of the hash value indicated by each bit is used, and the two bits are compared backwards, or traversed, within the group. .
  • the position of each raw data in the target data group is sorted within the group, and the position can also be considered as the position of each original data in the data segment.
  • the original data A, B, and C are divided into the same group. If A is ranked 3rd in the target data group, B is ranked 1st in the target data group, and C is ranked 6th in the target data group. After sorting, the actual storage order of A, B, and C in the group is: B, A, C.
  • the process of performing the first hash packet for each data segment from the top to the bottom is the same, and the designated bit bit is sequentially started from the undesired highest bit bit from the start of the first hash packet. .
  • the process of performing N times hash value grouping after the first hash packet needs to calculate the hash value of the original data, in the subsequent hash grouping process, only the unspecified bit bits of the hash values corresponding to the original data are used for hashing.
  • the preset grouping rule mentioned in the step S103 is to preset the number of hash packets N, or the preset total number of packets S, or preset the number of hash packets N and the total number of preset packets S; and, preset the number of hash packets N, the preset number of packets m of each hash packet and the total number of preset packets S.
  • the value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2.
  • Step S104 For each group obtained after each target data group has been hashed by N times, in the target data group, each group is sorted according to the hash value corresponding to the original data included in each group, from small to large. ;
  • each group in the target data group obtained after performing N hash grouping according to the preset grouping rule is reordered.
  • the way is: Sort the groups according to the hash value of the raw data contained in the group. For example: After grouping the target data sets, get group 1, group 2, and group 3; where, the raw data contained in group 1 has a hash value of 3, and the raw data contained in group 2 has a hash value of 5, group 3 The raw data contained in the hash value is 0. After sorting, the order of the groups in the target data group is: Group 3, Group 1 and Group 2.
  • each group obtained after performing N-time hash grouping according to a preset grouping rule the original data that is finally divided into the same group usually corresponds to the same hash value.
  • Step S105 Perform the Join operation on the original data in each group obtained after the N times hash group in the target data groups to be connected according to the ranking.
  • Step S105 is performed for each target data group to be connected, for the target data groups to be connected after sorting the original data in the same group divided by the hash grouping process in which the step S102 to the step S104 are performed.
  • An ordered group in sequence, joins a group in a target data group to be connected with another group in the target data group to be connected, and performs the Join operation on the ordered raw data in each group.
  • the hash value is calculated in groups by a vector, and then the same group is grouped.
  • the hash values corresponding to several original data contained in the one-time data are written into the corresponding group at one time.
  • Hash grouping in the form of a vector can avoid unnecessary cache thrashing, which reduces the cache miss and improves the performance of Join.
  • the hash value of each original data is calculated only in the first grouping process, and the number of bits used later are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.
  • the raw data in each group is sorted after each hash packet is written into each corresponding group.
  • the final sorting is performed for each group, since the original data has been partially sorted in the process of the multiplex grouping disclosed in the embodiment of the present invention, the original data in each group is locally The order is up, so you only need to sort the groups. In this way, the complexity of sorting the original data and the individual groups in each group after the grouping is completed in the prior art can be greatly reduced, and the time consumed by the sorting is reduced. And when this locally ordered raw data is joined, the sorting complexity is lower than the sorting complexity when the randomly allocated raw data is joined.
  • the hash connection method disclosed in the first embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the N times hash packets mentioned in step S103 shown in FIG.
  • the process of sequentially performing the first hash packet in the hash packet for each data segment in each target data group based on the preset grouping rule includes:
  • Step S1031 Calculate a hash value of the original data included in the current data segment, and use a bit bit to represent the calculated hash value.
  • the target data group is divided by a vector in a quantity unit according to the execution step S102. Taking any one of the target data groups as an example, when performing step S1031, the hash of each original data included in the same data segment is simultaneously calculated. The value, and the bit value is used to represent the hash value obtained by calculating each raw data. As described in the first embodiment of the present application, the bit bit is related to the number of bits of the computer itself in which the database is installed, and is determined by the maximum number of CPUs currently being the CPU of the computer.
  • step SI 032 in the process of performing the first hash sub-packet, according to the size of the data cache cache and the storage size of the page table buffer TLB, the bit bits required for the current hash packet are determined,
  • the hash value represented by the bit bit corresponding to each original data in the data segment is divided into the same group in the process of dividing the group by the original data corresponding to the hash value of the same bit position.
  • the hash value represented by the current bit bit is specified from the highest bit to the lowest bit direction, and when the group is divided, the same first two bits of the same hash value are corresponding.
  • the raw data is divided into the same group.
  • the position of the original data in the target data group is used to sort in the current group.
  • the original data is included in the same group: A, B, C, where A is at the 6th position of the target data group, B is at the 1st position of the target data group, and C is at the position of the target data group.
  • the position of the original data in the saved group obtained after executing step S1033 is: B, C, A, so that the original data in each group obtained by each division is ordered.
  • Step S1033 Associate the unspecified bit bits of the hash value corresponding to each original data with the original data, and save the associated bits of the original data corresponding to each hash value;
  • Step S1032 after performing the step S1033, after the group is divided, the hash value corresponding to the original data is not used in the hash packet process, or the unspecified bit bit is saved at the associated position of the original data.
  • the associated location may be a storage space adjacent to the original data, or may be another storage space associated with the original data.
  • the re-grouping is stopped. If the preset grouping rule is not met, the original data in each group after the current first hash grouping is continued to be grouped again.
  • the raw data in any one of the groups obtained after the last hash grouping in the second to nth hash packets is hash grouped, and n takes a positive integer greater than 2 and is included in N.
  • the above process of sequentially performing the second or even nth hash packets in the N segments of the data segments in each target data group based on the preset grouping rules includes:
  • Step S1034 According to the unspecified bit in the last hash packet saved in the original data association position in the current group, the original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process is divided. Within the same group, and for each of the original groups divided into the same group Starting data, sorting and saving each original data in the same group according to the position of each original data in the target data group;
  • step S1034 according to the bit bits required for the current hash packet specified in the bit position saved at the original data associated position, the hash value corresponding to the same value is assigned to the hash value of the same bit position.
  • the original data is in the same group, and at the same time, according to the same bit position, when the original data can be divided into the same group, the position of the original data in the target data group is used, and the current data is sorted in the current group. .
  • Step S1035 Save the remaining unspecified bit bits associated with each original data again at the associated position of the original data;
  • step S1035 the remaining unspecified bit bits are again saved at the associated position of the original data for use in subsequent packets.
  • the bit bit currently held at the original data associated position is the unused bit remaining after performing step S1032. If the bit bit currently used for the hash packet is still two bits, the same two bits are the two bits taken from the highest bit of the current remaining bit to the lowest bit.
  • step S1034 and the step S1035 are performed, if the current grouping situation does not satisfy the preset grouping rule, the loop returns to step S1034 and step S1035 until the current grouping of the target data group is stopped.
  • the target data group is grouped to satisfy the preset grouping rule, and the original data divided in the same group is sorted in each grouping process, so that each time the hash grouping process is obtained Although the grouping results are disordered as a whole, they are ordered in each group obtained.
  • the sorting complexity is lower than the randomly assigned raw data. Sorting complexity when joining.
  • the hash value of each original data is calculated only in the first grouping process, and the subsequent used bits are recorded to the corresponding associated positions of the original data for the subsequent grouping process. Used directly in the middle, thus eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources.
  • the original data in each group is sorted, so that after the last hash group is completed, the original data in each group is partially Ordered, so only the groups obtained after grouping the target data group hash need to be sorted. In this way, the original data and the groups in each group can be sorted after the grouping is completed in the prior art. The complexity of reducing the time spent by sorting.
  • the method for the hash connection according to the first embodiment and the second embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the preset grouping rule mentioned in step S103 shown in FIG.
  • the target data group is stopped after completing the hash packets by N times. Group by.
  • the value of N is determined by the storage size of the page table buffer TLB, and is a positive integer greater than or equal to 1.
  • the value of N is 4.
  • the process of performing hash grouping on the original data in any one group obtained after the last hash grouping in the second to nth hash packets disclosed in the first embodiment of the present invention is performed.
  • the hash grouping of the target data group is stopped. At this time, the obtained number of groups is the number of groups of the target data group.
  • the preset grouping rule is the preset total number S of packets
  • the value of S is determined by the size of the database cache cache and is a positive integer greater than or equal to 2.
  • Example 2 When the total number of preset packets that can be divided by the current target data group determined by the size of the database cache cache is 10, the first hash packet is performed for the current target data group, and after the first hash packet is completed, the obtained If the number of packets is less than 10, the hash packet is continued until the number of packets of the current target data group reaches 10, and the hash packet is stopped.
  • the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, sequentially for each of the target data groups The data segment is hashed until the hash packet is completed N times;
  • the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;
  • the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is consistent with the priority of the preset packet total S, sequentially for each of the The data segment in the target data group is hashed until the hash packet is completed N times and the number of packets of each of the target data groups is equal to the total number of preset packets S;
  • the priority of the preset hash packet number N and the preset total number S is determined by the storage size of the TLB and the size of the cache.
  • Example 3 The preset number of packets determined by the storage size of the page table buffer TLB is 3, and the total number of preset packets determined by the size of the database cache cache is 16.
  • the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the total number of packets obtained after the target data group is grouped 3 times based on the preset packet number is exactly 16; when the default hash is obtained
  • the priority of the number of packets N is higher than the total number of preset packets S, after the target data group is grouped 3 times based on the preset number of packets, there may be a case where the total number of packets obtained is less than 16, or equal to 16, Or greater than 16;
  • the priority of the preset total number S is higher than the preset hash packet number N, in the process of grouping, there may be a case where, when the total number of packets is 16, the target data group is obtained.
  • the number of groupings is greater than 3 times, or less than 3 times, or equal
  • the preset grouping rule includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, the value of N is determined by the storage size of the page table buffer TLB , is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, is a positive integer greater than or equal to 2; when hashing the data segments in each of the target data in turn The packet is grouped according to the preset number m of packets of each hash packet, so that the last number of packets is equal to the preset hash packet number N, and the total number of divided groups is equal to the total number of preset packets 8.
  • Example 4 as shown in FIG. 2, the preset number of packets determined by the storage size of the page table buffer TLB is 3, the number of packets per hash packet is 2, and the total number of preset packets determined by the size of the database cache cache is 16.
  • each data segment is subdivided into two groups in the first hash grouping process, and respectively written into the corresponding group;
  • each group after the previous grouping is again divided into two data segments and written into the corresponding groups, and so on until the hash group is executed for the target data group and 16 is obtained. Groups.
  • the preset grouping rule based on the hash grouping process mentioned in step S103 shown in FIG. 1 is mainly explained.
  • the preset grouping rule is mainly determined based on the storage size of the page table buffer TLB in the computer according to the database, and the size of the database cache cache. Based on the preset grouping rule, the cache miss may be avoided during the grouping process. Enter Improve the performance of subsequent Join.
  • a hash connection method according to the first embodiment to the third embodiment of the present invention, wherein, for step S102 shown in FIG. 1, the target data group is divided into a plurality of data segments by a vector vector.
  • the specific process includes:
  • the vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;
  • the number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments.
  • the target data group that needs to be hashed contains a total of 25 original data, with a vector as a quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment.
  • the target data group containing 25 raw data is divided into five data segments by dividing the target data unit by the number of vectors.
  • the original data contained in the 1st to 5th data segments is the same, as shown in Fig. 3, the case where the number of original data included in each data segment is the same.
  • the target data group that needs to be hashed contains a total of 28 original data, with a vector as the quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment.
  • the target data group containing 28 raw data is divided into six data segments by the vector number unit.
  • the original data contained in the 1st to 5th data segments is the same, and the 6th data segment contains 3 raw data, which is smaller than the original data contained in the 5th data segment of the 1st value.
  • a hash connection method according to Embodiment 2 of the present invention, wherein the steps disclosed in the above disclosure are divided into the same group, and each piece of original data divided in the same group is in the target data group according to each original data.
  • the location in the same group sorts and saves the original data in the same group.
  • the specific process is shown in Figure 4, including: the hash value;
  • Step S202 searching for a hash with the same value in the specified bit position in the current hash grouping process.
  • Each raw data corresponding to the value divides each original data into the same group;
  • the hash value is represented by a bit.
  • the hash value on the specified bit is looked up.
  • the specified bit bit may be specified according to the size of the database cache cache and the storage size of the page table buffer TLB before the current packet is performed; or may be based on the size and page table of the database cache cache when receiving the hash packet needs to be received.
  • the storage size of the buffered TLB is used to specify the bit bits to be used in the subsequent grouping process. When this grouping is performed, there is no need to re-specify, directly in the bit position required for this hash packet. Find it.
  • Step S203 traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;
  • Step S204 Arrange the original data corresponding to each subscript from small to large according to the size of each subscript;
  • Step S205 Write each original data into the same group and save according to the sequence from small to large.
  • steps S203 to S205 are performed to sort the original data divided in the same group and write them in the same group during the grouping process, so that the order is locally ordered in the process of the target data group.
  • a piece of data in units of vectors (shown by a dashed box in FIG. 5) is used to calculate a hash value together with the original data in the data segment.
  • value is the real value of the participating join
  • position in Figure 5 represents the position of each original data in the entire data segment
  • position-1 represents the subscript of each original data that is sorted and sorted in the same group
  • hash Value represents the hash value corresponding to the original data.
  • the original data that needs to be written into the current group is sorted while the original data is written into the current group.
  • the next adjacent vector is operated as above until all the vectors in the target data group have completed the current hash group.
  • the local hash group after the first hash group of the target data group is obtained, thereby sharing the burden of sorting the original data in the final sorting of each group, thereby realizing the reduction of group complexity.
  • the first hash packet is executed in the above manner for each data segment in the target data group currently grouped, if the current packet satisfies the preset packet rule, the re-grouping is stopped.
  • a hash connection method according to the second embodiment of the present invention, wherein, for the step S1034 disclosed above, based on the unspecified bit in the last hash packet saved at the original data association location in the current group, the current Each raw data corresponding to the same hash value in the specified bit position in the hash grouping process is divided into the same group, and each original data divided in the same group is in the target data group according to each original data. The locations are sorted and saved in the same group.
  • Figure 6 including:
  • Step S301 Calling an unspecified bit bit in the last hash packet saved at each original data association location in the group currently performing the hash packet;
  • the current group calls any one of the groups obtained after the last hash group, and calls the unspecified bit in the last hash packet saved in the original data association position in the current group. Bit, is for further current group to perform hash grouping again.
  • Step S302 determining, according to the unspecified bit bit of the call, a bit bit required for the current hash packet process, where the bit bit required in the current hash packet process is based on the size of the database cache cache and The storage size of the page table buffer TLB is determined;
  • Step S303 Find each original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process, and divide each original data into the same group.
  • Step S304 traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;
  • Step S305 Arrange, according to the size of each subscript, each original data corresponding to each subscript from small to large;
  • Step S306 Write each original data into the same group and save according to the sequence from small to large.
  • the sorting process of the original data divided into the same group in the above steps S304 to S306 is the same as the step S203 to the step S205 in the above-mentioned FIG. 4, and the detailed description is not mentioned here.
  • step S301 Performing the above step S301 to each group obtained by the previous hash group of the target data group Step S306, thereby obtaining a new group with internal raw data ordered after hashing again.
  • the hash packet is stopped. If the preset grouping rule is not satisfied, step S301 to step S303 are performed to group the groups obtained by the previous hash group again until the preset grouping rule is satisfied.
  • a hash connection method according to the above-mentioned first embodiment of the present invention to the third embodiment of the present invention, wherein, in step S105 of the above disclosure, the two target data groups to be connected are sequentially obtained by N in sequence
  • the raw data in each group obtained after the hash group is joined, and the specific process includes:
  • Step S501 Acquire, in sequence, each of the two target data groups to be connected to perform N times hash grouping
  • step S501 is executed to obtain each group in the two target data groups to be connected.
  • Step S502 The two groups perform a Join operation on the original data in each group of the two target data groups in a manner of performing a raw data Join operation.
  • the raw data join operation is performed according to the pair of two groups, and the original data in each group in the two target data groups to be connected is joined. As shown in Figure 7, it includes:
  • Step S503 sequentially traversing each group in another target data group by a group in a target data group;
  • Step S504 it is determined whether the current group traverses to the same group in another target data group, and if so, step S505 is performed, and if no, step S507 is performed;
  • Step S505 if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group, wherein the same group refers to the hash of the original data stored in the group.
  • the value is the same as the hash value of the raw data stored in the group used for traversal;
  • Step S506 determining whether the original data in any one of the two groups currently performing the Join operation has performed the Join operation, and if yes, executing step S507, and if not, continuing to perform the Join operation of the original data in the two groups. And returning to step S506;
  • Step S507 moving to the next group returns to step S503;
  • the hash connection performs grouping and the process required to be performed in the Join process.
  • the hash value is calculated simultaneously for the original data in each vector unit in the first grouping process, and then the hash values corresponding to the plurality of original data included in the same group are written to the corresponding one-time.
  • the subsequent use of a number of bit bits is recorded to the corresponding location of the corresponding original data for use in the subsequent grouping process, thereby eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources.
  • the original data in each group is sorted before the original data is written into each corresponding group, and each group is performed after the completion of the hash grouping. Sorting, so that after the final grouping is completed, the final sorting of each group can reduce the burden of sorting the data of the group and the internal data of the group, and reduce the time consumed by the sorting.
  • the hash connection apparatus is applied to a database, and mainly includes: a receiving unit 101, a dividing unit 102, a grouping unit 103, a sorting unit 104, and a connecting unit 105.
  • the receiving unit 101 is configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;
  • a subsequent dividing unit 102 is performed, and the grouping unit 103 and the sorting unit 104 undergo division, grouping and sorting, and then enter the connecting unit 105 to make the grouped waiting.
  • the two target data groups connected perform a Join operation.
  • the dividing unit 102 is configured to divide each target data group into multiple data segments by using a vector vector as a quantity unit;
  • the grouping unit 103 is configured to perform N times hash hash grouping on the data segments in each target data group in sequence according to a preset grouping rule, where the data segment is calculated based on the first hash group each time the hash grouping is performed.
  • the hash value represented by the bit in the original data is divided into the same group by the hash data corresponding to the same bit value in the current hash grouping process, and is divided into the same group.
  • N takes a positive integer greater than or equal to 1;
  • Sorting unit 104 is used to obtain a group obtained after N times hash grouping for each target data group, at the target In the data group, the ds and groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
  • a connecting unit 105 configured to sequentially take the two target data groups to be connected according to the sorting
  • the grouping unit 103 includes: a first hash grouping and a hash grouping module 1031 for the data segments in the target data group from top to bottom; and, in any group obtained after the last hash grouping
  • the primary hash grouping module 1031 is configured to calculate a hash value of the original data included in the current data segment, and use the bit bit to represent the calculated hash value; and the hash value corresponding to the same bit position is corresponding to the hash value.
  • the original data is divided into the same group, and each raw data divided into the same group is sorted and saved in the same group according to the position of each original data in the target data group; the hash corresponding to each original data is The unspecified bit in the value is associated with the original data and saved;
  • the multiple hash grouping module 1032 is configured to: use the unspecified bit in the last hash packet associated with and saved by the original data in the current group, and set the hash with the same value in the specified bit position in the current hash grouping process.
  • the original data corresponding to the value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group; The remaining unspecified bits of the original data association are saved again.
  • the grouping unit When the preset grouping rule is a preset hash packet number N, the grouping unit is configured to perform hash grouping on data segments in each of the target data groups in sequence, until N times hash packets are completed; When the preset grouping rule is the preset total number S of packets, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of groups of each target data group is equal to a preset.
  • the grouping unit is used to sequentially The data segment in the target data group is hashed until the hash packet is completed N times; when the preset packet rule is the preset hash packet number N and the preset packet total number S, and the preset packet total S is prioritized
  • the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of packets of each target data group is equal to a preset group. Total number S;
  • the grouping unit is used to sequentially Performing a hash grouping on the data segments in each of the target data groups until the N times hash packets are completed and the number of packets of each of the target data groups is equal to the preset total number S of packets;
  • the grouping unit is configured to group each hash according to a preset The number of packets is grouped such that the last number of packets is equal to the number of preset hash packets, and the total number of groups divided is equal to the total number of preset packets;
  • N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, N contains n, and m is less than N; the value of S is determined by the size of the database cache cache, which is greater than or equal to 2. A positive integer; the priority of the preset hash packet number N and the preset total number of packets S is determined by the storage size of the TLB and the size of the cache.
  • the execution unit and the principle of the dividing unit 102 shown in FIG. 8 are divided into the above-mentioned "the vector vector is a quantity unit to divide each of the target data groups.
  • the descriptions for the multiple data segments are the same, and are not described here. They mainly include:
  • a first dividing module configured to use a vector vector as a quantity unit, a vector corresponding to a data segment, and sequentially dividing each target data group into M data segments, wherein the value of M is determined by the original data in the target data group The number, and the size of the database cache cache and the storage size of the page table buffer TLB;
  • the number of original data included in the first to the M-1th data segments is the same, the Mth The number of original data included in the data segment is less than or equal to the number of original data contained in the first to M-1 data segments.
  • the original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and each original data divided in the same group is in accordance with each original data.
  • the first hash grouping module 1031 that is sorted and saved in the same group, the specific execution process and the principle can be referred to the first hash detailed description section disclosed in the third embodiment of the present invention. There is no longer a comment here, which mainly includes: the hash value represented by the bit;
  • the first search sub-module is configured to search for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and divide each original data into the same group, wherein, according to the size of the database cache cache and The storage size of the page table buffer TLB specifies the bit bits needed for the current hash packet;
  • a first traversal sub-module configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group;
  • a module configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;
  • the first sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
  • the hash value corresponding to the same bit in the current hash grouping process is corresponding to the hash value.
  • Each of the original data is divided into the same group, and each of the original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group. 1032, the specific implementation process and the principle can be referred to the detailed description of the multiple hash packets disclosed in the above-mentioned first embodiment to the fourth embodiment of the present invention, and details are not described herein.
  • Determining a sub-module configured to determine a bit bit to be used in a current hash packet process from the unspecified bit position of the call, where a bit number used in a current hash packet process is used According to the size of the library cache cache and the storage size of the page table buffer TLB;
  • the second search sub-module is configured to search for each original data corresponding to the hash value of the same bit position in the current hash grouping process, and divide each original data into the same group;
  • a second traversal sub-module configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange each original data corresponding to each subscript from small to large according to the size of each subscript;
  • the second sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
  • connection unit 105 can be referred to the detailed description of the Join operation in the fourth embodiment of the present invention, and details are not described herein.
  • An obtaining module configured to respectively acquire, in sequence, the two target data groups to be connected to each group after the N times hash grouping;
  • the Join module is used to perform the Join operation of the raw data in each group of the two target data groups by performing a Join operation of the original data for the pair of two groups;
  • the Join module includes:
  • a third traversal sub-module for sequentially traversing each group in another target data group by a group in a target data group; if traversing to the same group, executing the first Join sub-module; if not traversing to the same group Moving to the next group to return to the second traversal sub-module; until all groups in the target data group perform traversal operations on each of the other target data groups;
  • the first Join sub-module is configured to perform a Join operation on the original data in the group that is traversed, and the original data in the same group, wherein the same group refers to the original stored in the group.
  • the hash value of the data is the same as the hash value of the original data stored in the group for traversing; after the original data in the group has been subjected to the Join operation, moving to the next group returns to the third traversal sub-module.
  • Embodiment 5 of the present invention discloses a hash connection apparatus corresponding to the execution of the hash connection method described above. Based on the units and modules disclosed above, in the process of performing a hash grouping on a target data group, the hash is calculated in groups by a vector. The value, and then the hash value corresponding to several original data included in the same group is once written into the corresponding group. Group by vector It can avoid unnecessary cache thrashing, which can reduce the cache miss and improve the performance of Join. Moreover, the hash value of each original data is calculated only in the first grouping process, and the number of bits used in the subsequent use are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.
  • the hash connection method described in connection with the embodiments of the present disclosure can be implemented directly in hardware, in a memory executed by a processor, or a combination of both in a data management system. Accordingly, the present invention also discloses a data management system in accordance with the method and apparatus disclosed in the above embodiments of the present invention. Specific embodiments are given below for detailed description.
  • the data management system 1 includes a memory 11 and a processor 13 connected to the memory 11 via a bus 12.
  • the memory 11 has a storage medium in which a program for performing a database query is stored.
  • the memory 11 may contain high speed RAM memory and may also include non-volatile memory such as at least one disk memory.
  • the processor 13 is connected to the memory 11 via a bus 13, and the processor 13 calls the database query program stored in the memory 11 when performing a database query.
  • the database query program may include program code, and the program code includes a series of operation instructions arranged in a certain order.
  • Processor 13 may be a central processing unit CPU, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.
  • the program for performing data scheduling invoked by the processor 13 may specifically include:
  • N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the data segment based on the first hash packet
  • the hash value represented by the bit data obtained by the original data divides the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and divides the original data divided into the same group.
  • the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
  • the Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.
  • the embodiment of the present invention discloses that by performing the hash packet by using the vector as the quantity unit and using the unspecified bit in the previous hash grouping process in the subsequent grouping process, it is possible to perform hash packet processing on several original data at the same time, and multiple times.
  • it is not necessary to repeatedly calculate the hash value of the original data that is, the cache miss cache is reduced, and the hash value is repeatedly calculated to avoid waste of computing resources.
  • the original data divided in the same group is sorted, thereby achieving the purpose of reducing the complexity of sorting each group.
  • the sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.

Abstract

A Hash join method, device and database management system, the method comprising: when dividing a target data group during database query, using vector as a unit of quantity to divide and calculate the Hash value of the original data in a data segment, and representing the Hash value in bits; dividing the original data corresponding to the same Hash value of specified bits into the same group based on a preset grouping rule in Hash grouping, continuing to execute Hash grouping in subsequent grouping by utilizing the unspecified bits in the previous Hash grouping, and in the grouping process, according to the positions of the original data in the target data group, ranking the original data in the same group; and conducting a join operation on the grouped and ranked original data to be joined in the corresponding groups in the target data group, thus reducing the complexity of subsequent ranking of each group.

Description

哈希连接方法、 装置和数据库管理系统  Hash connection method, device and database management system
技术领域 本发明涉及数据库技术领域, 更具体的说, 是涉及一种哈希连接方法、 装 置和数据库管理系统。 背景技术 随着数据库技术的发展和应用, 数据库存储的数据量已从兆字节(M )及 千兆字节 (G )过渡到现在的兆兆字节 (T )和千兆兆字节 (P )。 基于当前数 据库所能存储的数据量, 用户在查询数据库的过程中, 所需要面对的则是 G 级、 T级甚至 P级的数据量。 在查询如此大的数据量的情况下, 需要满足查询 的快速响应, 则对数据库处理性能提出了很大的挑战, 而对数据库性能产生至 关重要的则是在查询过程中数据库对查询中包含的 Join操作 (连接操作) 的 处理响应时间。 TECHNICAL FIELD The present invention relates to the field of database technologies, and more particularly to a hash connection method, apparatus, and database management system. BACKGROUND With the development and application of database technology, the amount of data stored in a database has transitioned from megabytes (M) and gigabytes (G) to the current terabytes (T) and gigabytes ( P). Based on the amount of data that can be stored in the current database, the amount of data that the user needs to face in the process of querying the database is G, T, or even P. In the case of querying such a large amount of data, it is necessary to satisfy the fast response of the query, which poses a great challenge to the processing performance of the database, and the database performance is crucial in the query process. The processing response time of the Join operation (connection operation).
在数据库中实现 j0in操作的基本方法主要有 Hash Join(哈希连接), Merge Join以及针对 Grace Join做改进后的 Radix Join (聚集连接 ) 算法。 其中, 在 查询的过程中主要包括分组和 Join 两部分, 为避免分组过程中, 当分组数大 于 CPU的 TLB entry项 ( TLB , Translation Lookaside Buffer, 页表緩冲, TLB entry指在 LTB中緩存的页表条目 ) 时所导致的严重 TLB miss (指 TLB中没 有所需的表页) 问题, 现有的查询在分组阶段多半釆用多路分组的方法减少 TLB miss。 目前最常见的查询过程为: 首先, 釆用多路分组的方式进行分组, 且在每一次分组过程中对原始数据进行 hash计算, 然后, 在获得多路分组后, 进行 Join操作。 The basic methods for implementing j 0 i n operations in the database are mainly Hash Join, Merge Join, and the improved Radix Join algorithm for Grace Join. In the process of querying, the packet and Join are mainly included. To avoid the grouping process, when the number of packets is larger than the TLB entry of the CPU (TLB, Translation Lookaside Buffer, page table buffer, TLB entry refers to the buffer in the LTB. The severe TLB miss caused by the page table entry) (there is no required table page in the TLB). The existing query uses the multi-way packet method to reduce the TLB miss in the grouping phase. At present, the most common query process is as follows: First, grouping is performed by means of multiplexed packets, and the raw data is hashed in each grouping process, and then, after obtaining the multiplexed group, the Join operation is performed.
由上述可知,现有的进行数据库查询过程中, 面临分组阶段所釆用的多路 分组需要多次计算 hash值可能会产生大量 cache miss (緩存缺失, 指所请求的 数据不在要访问的存储器层), 以及浪费计算资源的问题。 发明内容 有鉴于此, 本发明实施例的目的在于提供一种哈希连接方法、装置和数据 库管理系统, 以克服现有进行数据库查询过程中, 所面临的浪费计算资源的问 题。 It can be seen from the above that in the existing database query process, multiple packets that are used in the packet phase need to calculate the hash value multiple times, which may generate a large number of cache misses (cache misses, indicating that the requested data is not in the memory layer to be accessed). ), and the problem of wasting computing resources. Summary of the invention In view of this, an object of the embodiments of the present invention is to provide a hash connection method, apparatus, and database management system, which overcomes the problem of wasting computing resources in the existing database query process.
为实现上述目的, 本发明实施例提供如下技术方案:  To achieve the above objective, the embodiment of the present invention provides the following technical solutions:
本发明实施例的第一方面提供了一种哈希连接方法,应用于数据库,包括: 接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析获取至少两 个待连接的目标数据组;  A first aspect of the embodiments of the present invention provides a hash connection method, which is applied to a database, and includes: receiving a structured query language SQL statement including a connection Join operation, and parsing and acquiring at least two target data groups to be connected;
以矢量 vector为数量单位将每一目标数据组划分为多个数据段;  Dividing each target data group into multiple data segments by a vector vector;
基于预设分组规则依次对每一目标数据组中的数据段进行 N次哈希 hash 分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所述数据段中的 原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程中指定 bit位上 取值相同的 hash值所对应的原始数据划分在同一小组内, 并对划分在同一小 组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小 组内进行排序并保存, N取大于或等于 1的正整数;  Performing N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the original data in the data segment based on the first hash packet The hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group is classified according to each original. The positions of the data in the target data group are sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;
对每一目标数据组经过 N次 hash分组后获得的小组, 在所述目标数据组 中, 按照各个小组中所包含的原始数据对应的 hash值由小至大的对各个小组 进行排序;  For each group obtained after N times of hash grouping for each target data group, in the target data group, the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
按照排序依次取所述两个待连接的目标数据组中经由 N次 hash分组后获 得的各个小组中的原始数据进行 Join操作。  The Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.
本发明实施例的第一方面的第一种实现方式中,所述基于预设分组规则依 次对每一目标数据组中的数据段进行 N次 hash分组中的第 1次 hash分组包括: 计算当前所述数据段内包含的原始数据的 hash值,并用 bit位表示计算所 得 hash值;  In the first implementation manner of the first aspect of the embodiment of the present invention, the performing, by using the preset grouping rule, the first hash packet in the N times hash packet for each data segment in each target data group includes: The hash value of the original data contained in the data segment, and the bit value is used to represent the calculated hash value;
将位于指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组 内, 并对划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数 据组中的位置在同一小组内进行排序和保存;  The original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and the original data divided in the same group is in the same position in the target data group according to each original data. Sort and save within the group;
将每一个原始数据对应的 hash值中未被指定的 bit位与该原始数据进行关 联, 并保存;  The unspecified bit bits of the hash value corresponding to each original data are associated with the original data and saved;
所述基于预设分组规则依次对每一目标数据组中的数据段进行 N次 hash 分组中的第 2次至第 n次 hash分组包括: 对上一次 hash分组后得到的任意一小组中的原始数据进行 hash分组, n 包含于 N, 取大于 2的正整数包括: The performing the second to the nth hash packets in the N hash packets in the data segment in each target data group in sequence based on the preset grouping rule includes: Hash the original data in any group obtained after the last hash grouping, n is included in N, and a positive integer greater than 2 includes:
基于当前小组内的原始数据所关联并保存的上一次 hash分组中未被指定 的 bit位, 将当前 hash分组过程中指定 bit位上取值相同的 hash值所对应的各 个原始数据划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照 各个原始数据在所述目标数据组中的位置在同一小组内进行排序和保存;  The original data corresponding to the hash value of the same bit in the current hash grouping process is divided into the same group based on the unspecified bit in the last hash packet associated with the original data in the current group. Internally, and sorting and saving the original data divided into the same group in the same group according to the position of each original data in the target data group;
将每一个原始数据关联的剩余的未被指定的 bit位再次保存。  The remaining unspecified bit bits associated with each raw data are saved again.
本发明实施例的第一方面中所涉及的第一种所述预设分组规则包括:预设 hash分组次数 N, 或者预设分组总数 S, 或者预设 hash分组次数 N和预设分 组总数 S;  The first type of the preset grouping rule involved in the first aspect of the embodiment of the present invention includes: preset the number of hash packets N, or preset the total number of packets S, or preset the number of hash packets N and the total number of preset packets S ;
当所述预设分组规则是预设 hash分组次数 N时, 依次对每一所述目标数 据中的数据段进行 hash分组, 直至完成 N次 hash分组;  When the preset grouping rule is the preset hash packet number N, the data segments in each of the target data are hash-grouped in turn until the N-th hash packet is completed;
当所述预设分组规则是预设分组总数 S时,依次对每一所述目标数据组中 的数据段进行 hash分组,直至每一所述目标数据组的分组数等于预设分组数; 当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级高于预设分组总数 S时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至完成 N次 hash分组;  When the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the number of packets of each of the target data groups is equal to the preset number of packets; The preset grouping rule is a preset hash packet number N and a preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, and is sequentially used in each of the target data groups. The data segment is hashed until the hash packet is completed N times;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 分组总数 S的优先级高于预设 hash分组次数 N时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至每一所述目标数据组的分组数等于预设分 组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级和预设分组总数 S的优先级一致, 依次对每一所述 目标数据组中的数据段进行 hash分组,直至完成 N次 hash分组且每一所述目 标数据组的分组数等于预设分组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, sequentially for each of the target data The data segment in the group is hashed until the hash packet is completed N times and the number of packets of each target data group is equal to the total number of preset packets S;
其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整 数, N包含 n; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正 整数;  The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2;
所述预设 hash分组次数 N与预设分组总数 S的优先级由 TLB的存储大小 和 cache的大小决定。 本发明实施例第一方面中涉及到的第二种所述预设分组规则包括: 预设 hash分组次数 N, 预设的每一次 hash分组的分组数 m和预设分组总数 S; 其 中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1 的正整数, m 小于 N; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正整数; 所述依次对每一所述目标数据中的数据段进行 hash分组时, 按照预设的 每一次 hash分组的分组数进行分组, 使得最后的分组次数等于预设 hash分组 次数, 所分的小组的总数等于预设分组总数。 The priority of the preset hash packet number N and the preset packet total S is determined by the storage size of the TLB and the size of the cache. The second preset packet rule involved in the first aspect of the embodiment of the present invention includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, N The value is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2; When the data segment in each of the target data is hashed, the packet is grouped according to the preset number of packets of each hash packet, so that the last packet number is equal to the preset hash packet number, and the total number of the divided groups is equal to the preset. The total number of groups.
本发明实施例的第一方面的第二种实现方式中,所述以矢量 vector为数量 单位将每一目标数据组划分为多个数据段包括:  In a second implementation manner of the first aspect of the embodiments of the present invention, the dividing each target data group into multiple data segments by using a vector vector is:
以矢量 vector为数量单位, 一个 vector对应一个数据段 , 顺序将每一目标 数据组划分为 M个数据段, M的取值由所述目标数据组内的原始数据的个数, 及数据库緩存 cache的大小和页表緩冲 TLB的存储大小决定;  The vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;
其中, 第 1至第 M-1个数据段中所包含的原始数据的个数相同, 第 M个 数据段中所包含的原始数据的个数小于或等于第 1至 M-1个数据段中所包含 的原始数据的个数。  The number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments. The number of raw data contained.
本发明实施例第一方面的第三种实现方式中, 将位于指定 bit位上取值相 同的 hash值所对应的各个原始数据划分在同一小组内, 并对划分在同一小组 内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小组 内进行排序和保存包括: 查找位于当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各 个原始数据,将各个原始数据划分在同一小组内,其中,依据数据库緩存 cache 的大小和页表緩冲 TLB的存储大小指定当前 hash分组所需用到的 bit位; 遍历划分在同一小组内的各个原始数据的下标,所述各个原始数据的下标 用于标识各个原始数据在所述目标数据组中的位置;  In a third implementation manner of the first aspect of the embodiment of the present invention, each original data corresponding to a hash value having the same value in a specified bit position is divided into the same group, and each original data divided into the same group is divided. Sorting and saving in the same group according to the position of each original data in the target data group includes: searching for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and each original The data is divided into the same group, wherein the bit size required for the current hash packet is specified according to the size of the database cache cache and the storage size of the page table buffer TLB; traversing the subscripts of each original data divided in the same group, The subscripts of the respective original data are used to identify the location of each original data in the target data group;
按照各个下标的大小, 从小至大排列各个下标对应的原始数据;  According to the size of each subscript, the original data corresponding to each subscript is arranged from small to large;
依据所述从小至大的顺序将各个原始数据写入同一小组内并保存。  Each raw data is written into the same group and saved in the order from small to large.
本发明实施例第一种实现方式中,基于当前小组内的原始数据所关联并保 存的上一次 hash分组中未被指定的 bit位, 将当前 hash分组过程中指定 bit位 上取值相同的 hash值所对应的各个原始数据划分在同一小组内, 并对划分在 同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在 同一小组内进行排序和保存包括: In the first implementation manner of the embodiment of the present invention, based on the unspecified bit in the last hash packet associated with and saved by the original data in the current group, the hash with the same value in the specified bit position in the current hash grouping process is used. The original data corresponding to the value is divided into the same group, and is divided in Each raw data in the same group is sorted and saved in the same group according to the position of each raw data in the target data group, including:
调用当前进行 hash分组的小组内各个原始数据关联位置处所保存的上一 次 hash分组中未被指定的 bit位;  Invoking an unspecified bit in the last hash packet saved at each original data association location within the group currently performing the hash packet;
从调用的所述未被指定的 bit位中确定当前 hash分组过程中所需用到的 bit位, 其中, 当前 hash分组过程中所需用到的 bit位依据数据库緩存 cache 的大小和页表緩冲 TLB的存储大小决定;  Determining the bit bits required in the current hash packet process from the unspecified bit bits of the call, wherein the bit bits required in the current hash packet process are slowed according to the size of the database cache cache and the page table. The storage size of the TLB is determined;
查找位于当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各 个原始数据, 将各个原始数据划分在同一小组内;  Finding each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and dividing each original data into the same group;
遍历划分在同一小组内的各个原始数据的下标,所述各个原始数据的下标 用于标识各个原始数据在所述目标数据组中的位置;  Traversing the subscripts of the respective original data divided into the same group, the subscripts of the respective original data are used to identify the locations of the respective original data in the target data group;
按照各个下标的大小, 从小至大排列各个下标对应的原始数据;  According to the size of each subscript, the original data corresponding to each subscript is arranged from small to large;
依据所述从小至大的顺序将各个原始数据写入同一小组内并保存。  Each raw data is written into the same group and saved in the order from small to large.
本发明实施例第一种实现方式中,所述按照排序依次取所述两个待连接的 目标数据组中经由 N次 hash分组后获得的各个小组中的原始数据进行 Join操 作包括:  In the first implementation manner of the embodiment of the present invention, the performing the Join operation by using the original data in each group obtained by the N times hash group in the target data group to be connected in the order of the following:
按顺序分别获取所述待连接的两个目标数据组进行 N次 hash分组后的各 个小组;  Obtaining, in order, the two target data groups to be connected respectively for each group after N hash packets;
两两小组为一对进行原始数据 Join操作的方式, 对两个目标数据组的各 个小组中原始数据进行 Join操作;  The two groups work as a pair of raw data join operations, and perform the Join operation on the original data in each of the two target data groups;
所述两两小组为一对进行原始数据 Join操作的方式包括:  The manner in which the two groups are a pair of original data join operations includes:
由一目标数据组中的一小组顺序遍历另一目标数据组中的各个小组; 若遍历到相同小组时,将所述小组中的原始数据, 顺序与所述相同小组内 的原始数据进行 Join操作, 其中, 所述相同小组是指该小组内存储的原始数 据的 hash值与用于遍历的小组内存储的原始数据的 hash值相同;  Navigating each group in another target data group sequentially by a group in a target data group; if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group , wherein the same group means that the hash value of the original data stored in the group is the same as the hash value of the original data stored in the group for traversing;
当所述小组中的原始数据都已进行执行 Join操作后, 移动至下一小组返 回执行顺序遍历另一目标数据组中的各个小组这一步骤;  After the original data in the group has been subjected to the Join operation, move to the next group to return to the execution sequence to traverse the various groups in the other target data group;
若未遍历到相同小组时,则移动至下一小组返回执行顺序遍历另一目标数 据中的各个小组这一步骤;  If the same group is not traversed, move to the next group to return to the execution sequence to traverse the various groups in the other target data;
直至所述目标数据组中的所有小组对另一目标数据组中的各个小组都执 行遍历操作。 Until all the teams in the target data group are holding on to each group in the other target data group Line traversal operation.
本发明实施例的第二方面提供了一种哈希连接装置,应用于数据库,包括: 接收单元, 用于接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析获取至少两个待连接的目标数据组;  A second aspect of the embodiment of the present invention provides a hash connection apparatus, which is applied to a database, and includes: a receiving unit, configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two to be connected. Target data set;
划分单元,用于以矢量 vector为数量单位将每一目标数据组划分为多个数 据段;  a dividing unit, configured to divide each target data group into a plurality of data segments by using a vector vector;
分组单元,用于基于预设分组规则依次对每一目标数据组中的数据段进行 N次哈希 hash分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所 述数据段中的原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程 中指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组内,并对 划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的 位置在同一小组内进行排序并保存, N取大于或等于 1的正整数;  a grouping unit, configured to sequentially perform N hash hash packets for each data segment in each target data group based on a preset grouping rule, where, in each hash packet, the data segment is calculated based on the first hash packet The raw data is represented by the bit value, and the original data corresponding to the hash value of the same bit position in the current hash grouping process is divided into the same group, and each original is divided into the same group. Data, sorted and saved in the same group according to the position of each original data in the target data group, and N takes a positive integer greater than or equal to 1;
排序单元, 用于对每一目标数据组经过 N次 hash分组后获得的小组, 在 所述目标数据组中, 按照各个小组中所包含的原始数据对应的 hash值由小至 大对各个 d、组进行排序;  a sorting unit, configured to obtain a group obtained after N times hash grouping for each target data group, in which the hash value corresponding to the original data included in each group is from small to large for each d, Group sorting;
连接单元, 用于按照排序依次取所述两个待连接的目标数据组中经由 N 次 hash分组后获得的各个小组中的原始数据进行 Join操作。  a connecting unit, configured to perform a Join operation on the original data in each group obtained after the N hash packets in the target data groups to be connected in the order of the two connected data groups.
本发明实施例的第三方面提供了一种数据库管理系统,应用于数据库, 包 括:  A third aspect of the embodiments of the present invention provides a database management system, which is applied to a database, and includes:
具有存储介质的存储器, 所述存储器中存储有进行数据库查询时的程序; 通过总线与所述存储器连接的处理器, 当执行数据库查询时, 所述处理器 调用所述存储器中存储的数据库查询程序,并依据上述所述的本发明实施例的 第一方面提供的一种哈希连接方法执行所述数据库查询程序。  a memory having a storage medium, wherein the memory stores a program for performing a database query; and a processor connected to the memory via a bus, when the database query is executed, the processor invokes a database query program stored in the memory And executing the database query procedure according to a hash connection method provided by the first aspect of the embodiments of the present invention described above.
经由上述的技术方案可知, 与现有技术相比, 本发明实施例公开了一种哈 希连接方法、 装置和数据库管理系统。 该方法在进行数据库查询时, 在确定待 连接的目标数据组之后, 对目标数据组进行分组过程中, 首先, 以矢量 vector 为数量单位将待连接的目标数据组划分为多个数据段, 然后,计算数据段内包 含的原始数据的 hash值, 并用比特 bit位表示 hash值; 然后, 基于预设分组 规则, 在第 1次 hash分组时计算所得的各个原始数据用 bit位表示的 hash值, 在进行 hash分组的过程中, 将当前 hash分组过程中指定 bit位上取值相同的 hash值所对应的各个原始数据划分在同一小组内,并对划分在同一小组内的各 个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小组内进行 排序并保存。 According to the above technical solution, the embodiment of the present invention discloses a hash connection method, device and database management system as compared with the prior art. In the method of performing a database query, after determining the target data group to be connected, the target data group is grouped into a plurality of data segments, and then the target data group to be connected is divided into multiple data segments by using a vector vector. And calculating a hash value of the original data included in the data segment, and using the bit bit to represent the hash value; and then, based on the preset grouping rule, calculating the hash value represented by the bit of each original data in the first hash packet, In the process of performing a hash packet, the value of the specified bit in the current hash grouping process is the same. The original data corresponding to the hash value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group.
本发明实施例通过以 vector为数量单位以及在 hash分组过程中利用指定 bit位执行 hash分组, 能够实现同时对若干原始数据进行 hash分组处理, 且在 多次 hash分组的过程中不需要重复计算原始数据的 hash值, 即减少了 cache miss緩存缺失, 也省去了重复计算 hash值避免了计算资源的浪费。  The embodiment of the present invention can perform hash packet processing on a plurality of original data simultaneously by using a vector as a quantity unit and a hash packet by using a specified bit in the hash grouping process, and does not need to repeatedly calculate the original in the process of multiple hash packets. The hash value of the data, which reduces the cache miss cache miss, also eliminates the need to repeatedly calculate the hash value to avoid the waste of computing resources.
并且每次分组划分至每个小组中的原始数据有序,这样完成多个数据段分 组后得到的每个小组中的原始数据局部有序, 在对局部有序的原始数据进行 join时, 其排序复杂度低于随机分配的原始数据进行 join时的排序复杂度。  And each time the grouping is divided into the original data in each group, so that the original data in each group obtained after grouping the plurality of data segments is locally ordered, and when the local ordered original data is joined, The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.
附图说明 为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需 要使用的附图作简单地介绍,显而易见地, 下面描述中的附图仅仅是本发明的 实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可 以根据提供的附图获得其他的附图。 BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are merely embodiments of the present invention. For those skilled in the art, other drawings may be obtained according to the provided drawings without any creative work.
图 1为本发明实施例一公开的一种哈希连接方法的流程图;  FIG. 1 is a flowchart of a hash connection method according to Embodiment 1 of the present invention;
图 2为本发明实施例三公开的示例四公开的 3次 hash分组的示意图; 图 3为本发明实施例四公开的每个数据段中包含相同原始数据的示意图; 图 4为本发明实施例四公开的第 1次 hash分组过程中划分小组的流程图; 图 5为本发明实施例四公开的对一段数据段中的原始数据进行 hash分组 的示意图;  2 is a schematic diagram of a third-time hash packet disclosed in Example 4 of the third embodiment of the present invention; FIG. 3 is a schematic diagram of the same original data included in each data segment disclosed in Embodiment 4 of the present invention; FIG. 5 is a schematic diagram of a grouping of raw data in a data segment according to Embodiment 4 of the present invention; FIG.
图 6为本发明实施例四公开的第 2次至第 N次 hash分组过程中划分小组 的流程图;  6 is a flowchart of dividing a group in a second to Nth hash grouping process according to Embodiment 4 of the present invention;
图 7 为本发明实施例四公开的对两个待连接的目标数据中各个小组中的 原始数据进行 Join操作的流程图;  7 is a flowchart of performing a Join operation on original data in each group of two target data to be connected according to Embodiment 4 of the present invention;
图 8为本发明实施例五公开的一种哈希连接装置的结构示意图;  FIG. 8 is a schematic structural diagram of a hash connection apparatus according to Embodiment 5 of the present invention; FIG.
图 9为本发明实施例五公开的一种数据库管理系统的结构示意图。 具体实施方式 为了引用和清楚起见, 下文中使用的技术名词的说明、 简写或缩写总结如 下: FIG. 9 is a schematic structural diagram of a database management system according to Embodiment 5 of the present invention. detailed description For the purposes of reference and clarity, the description, abbreviations or abbreviations of the technical terms used below are summarized as follows:
TLB , Translation Look aside Buffer, 页表緩冲, TLB entry指在 LTB中緩 存的页表条目;  TLB, Translation Look aside Buffer, page table buffer, TLB entry refers to the page table entry cached in LTB;
Radix Join, 聚集连接;  Radix Join, aggregate connection;
cache miss, 緩存缺失, 指所请求的数据不在要访问的存储器层。  Cache miss, cache miss, means that the requested data is not in the memory layer to be accessed.
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清 楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而不是 全部的实施例。基于本发明中的实施例, 本领域普通技术人员在没有做出创造 性劳动前提下所获得的所有其他实施例, 都属于本发明保护的范围。  BRIEF DESCRIPTION OF THE DRAWINGS The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without the creative work are all within the scope of the present invention.
由背景技术可知,在当前常用的查询过程中 ,分组阶段所釆用的多路分组 , 在每一次分组过程中都需要釆用一个一个处理原始数据的方式,需要多次计算 原始数据的 hash值从而面临浪费计算资源的问题。 因此, 本发明实施例提供了 一种哈希连接方法, 装置和数据管理系统, 通过利用矢量 vector为数量单位以 及在后续分组过程中利用当前 hash分组过程中被指定 bit位执行 hash分组, 能够 实现同时对若干原始数据进行 hash分组处理, 且在多次 hash分组的过程中不需 要重复计算原始数据的 hash值, 即减少了 cache miss緩存缺失, 也省去了重复 计算 hash值, 避免了计算资源的浪费。 同时, 每次 hash分组划分至每个小组中 的原始数据有序,这样完成多个数据段分组后得到的每个小组中的原始数据局 部有序, 在对局部有序的原始数据进行 join时, 其排序复杂度低于随机分配的 原始数据进行 join时的排序复杂度。 具体过程通过以下本发明实施例进行详细 说明。  It can be seen from the background that in the current commonly used query process, the multiplexed packets used in the grouping phase need to use one method for processing the original data in each grouping process, and the hash value of the original data needs to be calculated multiple times. Thus facing the problem of wasting computing resources. Therefore, an embodiment of the present invention provides a hash connection method, apparatus, and data management system, which can implement a hash packet by using a vector vector as a quantity unit and using a specified bit bit in a current hash grouping process in a subsequent grouping process. At the same time, hash data processing is performed on a plurality of original data, and the hash value of the original data does not need to be repeatedly calculated in the process of multiple hash packets, that is, the cache miss cache is reduced, and the hash value is repeatedly calculated, thereby avoiding the computing resource. Waste. At the same time, each hash group is divided into the original data in each group, so that the original data in each group obtained after grouping multiple data segments is locally ordered, when the local ordered original data is joined. The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join. The specific process is described in detail by the following embodiments of the present invention.
实施例一  Embodiment 1
本发明实施例一公开了一种哈希连接方法, 该方法应用于数据库, 其流程 如图 1中的步骤 S101至步骤 S105所示, 具体过程包括:  The first embodiment of the present invention discloses a hash connection method, and the method is applied to a database. The process is as shown in step S101 to step S105 in FIG.
步骤 S101 , 接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析 获取至少两个待连接的目标数据组;  Step S101: Receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;
在执行数据库查询的过程中, 执行步骤 S101 , 由数据库对接收到的包含 有 Join操作的 SQL查询语句进行解析, 从中至少获取两个待连接的目标数据 组。 也就是说, 以两个待连接的目标数据组为一对, 在解析的过程中至少会出 现两个待连接的目标数据组, 也就是说待连接的目标数据组是成对解析的。 In the process of executing the database query, step S101 is executed, and the received SQL query statement containing the Join operation is parsed by the database, and at least two target data to be connected are obtained. Group. That is to say, two target data groups to be connected are paired, and at least two target data groups to be connected appear in the process of parsing, that is to say, the target data groups to be connected are parsed in pairs.
步骤 S102 , 以矢量 vector为数量单位将每一目标数据组划分为确定数据 的多个数据段;  Step S102, dividing each target data group into a plurality of data segments of the determined data by using a vector vector as a quantity unit;
在步骤 S 102中, 针对解析出的成对的两个待连接的目标数据组执行相同 的操作, 在划分数据段的过程中以一个目标数据组为例。  In step S102, the same operation is performed on the parsed pair of target data groups to be connected, and a target data group is taken as an example in the process of dividing the data segments.
以矢量 vector 为数量单位划分当前的目标数据组。 具体的, 以该 vector 为数量单位是指以一个 vector 内包含多少个原始数据为固定单位。 并利用该 vector数量单位将目标数据组划分为多个数据段, 也就是说一个数据段对应一 个 vector。  The current target data set is divided by the vector vector. Specifically, the unit of the vector refers to how many pieces of raw data are contained in a vector as a fixed unit. The target data group is divided into a plurality of data segments by using the vector quantity unit, that is, one data segment corresponds to one vector.
需要说明的是,在通常情况下, 以一个数据段中可包含的最多原始数据个 数为一个 vector单位,将所述目标数据组划分为多个数据段, 划分后的各个数 据段中包含的原始数据个数通常为相同的。 当然也存在,按照预设分组规则以 及该目标数据组中的原始数据的总个数对一个数量单元 vector 中所包含的原 始数据的个数进行限定,并非以其所能包含的最多的原始数据的个数对该数量 单元 vector进行限定。  It should be noted that, in a normal case, the maximum number of original data that can be included in one data segment is one vector unit, and the target data group is divided into multiple data segments, and the divided data segments are included in the data segment. The number of raw data is usually the same. Of course, there is also a limitation on the number of original data contained in a quantity unit vector according to a preset grouping rule and the total number of original data in the target data group, not the most original data that can be included therein. The number of units is limited to the quantity unit vector.
上述两种方式, 都不排除, 最后一个数据段中所包含的原始数据的个数小 于其他数据段中包含的原始数据的个数的情况。  The above two methods do not exclude the case where the number of original data contained in the last data segment is smaller than the number of original data contained in other data segments.
基于上述方式, 执行步骤 S102之后, 可将每一个待连接的目标数据组都 划分成多个数据段。 本发明实施例釆用矢量化的方法, 在后续进行 hash分组 的过程中, 以一个 vector为数量单位, 针对该 vector 内的原始数据同时计算 hash值, 然后将同一个分组中的若干个原始数据一次性写入对应的分组中,从 而减少了 cache miss, 能够提升 join性能。  Based on the above manner, after performing step S102, each target data group to be connected can be divided into a plurality of data segments. In the embodiment of the present invention, a vectorization method is used. In the process of subsequent hash grouping, a vector is used as a quantity unit, and a hash value is simultaneously calculated for the original data in the vector, and then several original data in the same group are compared. Write the corresponding group at one time, which reduces the cache miss and improves the join performance.
步骤 S103 , 基于预设分组规则依次对每一目标数据组中的数据段进行 N 次 hash分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所述数据 段中的原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程中指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组内, 并对划分在 同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在 同一小组内进行排序并保存, N取大于或等于 1的正整数;  Step S103: Perform N times hash grouping on the data segments in each target data group in sequence according to the preset grouping rule, where, in each hash grouping, calculate the original data in the data segment based on the first hash group. The hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group, according to each The position of the original data in the target data group is sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;
在执行上述步骤 S103的过程中, 基于预设分组规则依次对每个目标数据 组中的数据段进行 N次 hash分组。 其中, 在第 1次 hash分组过程中, 以一个 目标数据组为例, 由上至下从该目标数据的第一个数据段开始 hash分组至最 后一个数据段结束。 以一个数据段为例, 在进行第 1次 hash分组时, 对该数 据段内所包含的全部原始数据同时计算 hash值, 并将各个原始数据的 hash值 用 bit位表示,该 bit位与安装该数据库的计算机本身的位数相关,是由当前是 计算机的 CPU最大寻址数决定的。 In the process of performing the above step S103, each target data is sequentially performed based on a preset grouping rule. The data segments in the group are hashed N times. In the first hash grouping process, taking a target data group as an example, starting from the first data segment of the target data from the top to the bottom, the hash packet is ended to the end of the last data segment. Taking a data segment as an example, when the first hash packet is performed, the hash value is calculated simultaneously for all the original data contained in the data segment, and the hash value of each original data is represented by a bit, and the bit is installed. The number of bits in the database itself is determined by the maximum number of CPUs currently CPU of the computer.
例如, 当前安装该数据库的计算机为 32位, 则在进行第 1次 hash分组过 程中计算的原始数据所对应的 hash值用 32位的 bit位表示。 若当前安装该数 据库的计算机为 64位, 则在进行第 1次 hash分组过程中计算的原始数据所对 应的 hash值用 64位的 bit位表示。  For example, if the computer currently installing the database is 32 bits, the hash value corresponding to the original data calculated during the first hash grouping process is represented by a 32-bit bit. If the computer on which the database is currently installed is 64-bit, the hash value corresponding to the original data calculated during the first hash grouping is represented by a 64-bit bit.
然后, 根据当前第 1次 hash分组所需要利用到的 bit位数, 也就是指定 bit位, 在各个用 bit位表示的 hash值的指定 bit位上进行比对, 或遍历, 或查 找在指定 bit位上取值相同的 hash值, 并将该 hash值所对应的原始数据划分 在同一小组内。 例如, 第 1次 hash分组所需要的 bit位数为 2位, 则此时从各 个用 bit位表示的 hash值的最高位开始, 向后取两位 bit位进行比对, 或遍历, 组内。  Then, according to the number of bits used in the current first hash packet, that is, the specified bit, the comparison is performed on the specified bit of each hash value represented by the bit, or traversed, or searched at the specified bit. The hash value with the same value is set, and the original data corresponding to the hash value is divided into the same group. For example, if the number of bits required for the first hash packet is 2 bits, then the highest bit of the hash value indicated by each bit is used, and the two bits are compared backwards, or traversed, within the group. .
最后,针对划分在同一小组内的各个原始数据,按照各个原始数据在目标 数据组中的位置在该小组内进行排序,该位置也可以认为是各个原始数据在数 据段中的位置。 例如, 原始数据 A、 B、 C划分在同一小组内, 若 A排在目标 数据组的第 3位, B排在目标数据组的第 1位, C排在目标数据组的第 6位, 经过排序后, 在该小组内 A、 B、 C的实际存储顺序为: B、 A、 C。  Finally, for each raw data divided into the same group, the position of each raw data in the target data group is sorted within the group, and the position can also be considered as the position of each original data in the data segment. For example, the original data A, B, and C are divided into the same group. If A is ranked 3rd in the target data group, B is ranked 1st in the target data group, and C is ranked 6th in the target data group. After sorting, the actual storage order of A, B, and C in the group is: B, A, C.
需要说明的是, 由上至下对每一个数据段进行第 1次 hash分组的过程相 同, 从开始第 1次 hash分组开始依次从未被指定的最高位 bit位开始指定 bit 位。。 在执行 N次 hash值分组过程中, 除第 1次 hash分组时需要计算原始数 据的 hash值之后, 后续 hash分组过程中, 仅利用各个原始数据对应的 hash 值中未被指定的 bit位进行 hash分组, 将当前 hash分组过程中所需利用的 bit 位上取值相同的 hash值所对应的原始数据划分在同一小组内, 并釆用与第 1 次 hash分组相同的方式, 对划分在同一小组内的各个原始数据, 按照各个原 始数据在目标数据组或者数据段中的位置对各个原始数据在本小组内进行排 序。 It should be noted that the process of performing the first hash packet for each data segment from the top to the bottom is the same, and the designated bit bit is sequentially started from the undesired highest bit bit from the start of the first hash packet. . In the process of performing N times hash value grouping, after the first hash packet needs to calculate the hash value of the original data, in the subsequent hash grouping process, only the unspecified bit bits of the hash values corresponding to the original data are used for hashing. Grouping, dividing the original data corresponding to the hash value of the same bit used in the current hash grouping process into the same group, and dividing it into the same group in the same manner as the first hash grouping Each raw data in the row, according to the position of each raw data in the target data group or the data segment, the original data is arranged in the group. Preface.
在步骤 S103中所提到的预设分组规则是指, 预设 hash分组次数 N, 或者 预设分组总数 S, 或者预设 hash分组次数 N和预设分组总数 S; 以及, 预设 hash分组次数 N, 预设的每一次 hash分组的分组数 m和预设分组总数 S。 其 中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1 的正整数, m 小于 N; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正整数。  The preset grouping rule mentioned in the step S103 is to preset the number of hash packets N, or the preset total number of packets S, or preset the number of hash packets N and the total number of preset packets S; and, preset the number of hash packets N, the preset number of packets m of each hash packet and the total number of preset packets S. The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2.
步骤 S104, 对每一目标数据组经过 N次 hash分组后获得的小组, 在所述 目标数据组中, 按照各个小组内所包含的原始数据对应的 hash值由小至大对 各个小组分别进行排序;  Step S104: For each group obtained after each target data group has been hashed by N times, in the target data group, each group is sorted according to the hash value corresponding to the original data included in each group, from small to large. ;
在步骤 S104中,对按照预设分组规则进行 N次 hash分组后得到的目标数 据组中的各个小组进行再次排序。 方式为: 按照该小组内所包含的原始数据的 hash值的大小, 对各个小组进行排序。 如: 在对目标数据组进行分组后获得小 组 1、 小组 2和小组 3; 其中, 小组 1中包含的原始数据的 hash值为 3 , 小组 2中包含的原始数据的 hash值为 5 , 小组 3中包含的原始数据的 hash值为 0, 在进行排序后, 此时该目标数据组中的小组的顺序为: 小组 3、 小组 1和小组 2。  In step S104, each group in the target data group obtained after performing N hash grouping according to the preset grouping rule is reordered. The way is: Sort the groups according to the hash value of the raw data contained in the group. For example: After grouping the target data sets, get group 1, group 2, and group 3; where, the raw data contained in group 1 has a hash value of 3, and the raw data contained in group 2 has a hash value of 5, group 3 The raw data contained in the hash value is 0. After sorting, the order of the groups in the target data group is: Group 3, Group 1 and Group 2.
需要说明的是,按照预设分组规则进行 N次 hash分组后得到的各个小组, 最后划分在同一小组内的原始数据通常对应相同的 hash值。  It should be noted that each group obtained after performing N-time hash grouping according to a preset grouping rule, the original data that is finally divided into the same group usually corresponds to the same hash value.
步骤 S105 ,按照排序依次取所述两个待连接的目标数据组中经由 N次 hash 分组后获得的各个小组中的原始数据进行 Join操作。  Step S105: Perform the Join operation on the original data in each group obtained after the N times hash group in the target data groups to be connected according to the ranking.
针对通过上述执行步骤 S102至步骤 S104的 hash分组过程中对所划分的 同一小组内原始数据进行排序后的两个待连接的目标数据组,执行步骤 S105 , 针对每个待连接的目标数据组中有序的小组,按照顺序将一个待连接的目标数 据组中一个小组与另一个待连接的目标数据组中的一下小组进行 Join操作, 即将各个小组中有序的原始数据执行 Join操作。 从而实现当次的数据库查询 的任务。  Step S105 is performed for each target data group to be connected, for the target data groups to be connected after sorting the original data in the same group divided by the hash grouping process in which the step S102 to the step S104 are performed. An ordered group, in sequence, joins a group in a target data group to be connected with another group in the target data group to be connected, and performs the Join operation on the ordered raw data in each group. Thus the task of the current database query is realized.
针对现有技术中由于硬件的 TLB entry项大于 cache way的数目, 釆用一 个个计算的 hash值进行分组容易导致大量的 cache thrashing,从而产生大量的 cache miss, 影响原本 join的性能的问题。 通过上述本发明实施例一公开的哈 希连接方法, 以一个 vector为数量单位按组计算 hash值, 然后将同一个分组 中所包含的若干个原始数据对应的 hash值一次性写入对应的小组中。以 vector 的形式进行 hash分组则能够避免产生不必要的 cache thrashing, 从而减少了 cache miss, 实现提升 Join性能的目的。 并且, 仅在第 1次分组过程中计算各 个原始数据的 hash值,而将后续使用到的若干 bit位记录到对应的各个原始数 据的关联位置处以备后续分组过程中直接使用, 从而省去重复计算 hash值的 代价, 避免了资源浪费。 In the prior art, since the TLB entry of the hardware is larger than the number of cache ways, grouping the hash values calculated by one by one easily leads to a large amount of cache thrashing, thereby generating a large number of cache misses, which affects the performance of the original join. According to the hash connection method disclosed in the first embodiment of the present invention, the hash value is calculated in groups by a vector, and then the same group is grouped. The hash values corresponding to several original data contained in the one-time data are written into the corresponding group at one time. Hash grouping in the form of a vector can avoid unnecessary cache thrashing, which reduces the cache miss and improves the performance of Join. Moreover, the hash value of each original data is calculated only in the first grouping process, and the number of bits used later are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.
同时, 在本发明实施例一公开的哈希连接进行 hash分组的过程中, 在每 一次 hash分组之后, 将原始数据写入各个对应的小组内之前, 对各个小组内 的原始数据进行排序,使得在最后分组完成之后,对各个小组做最后的排序时, 由于在本发明实施例公开的多路分组的过程中已经对原始数据进行了一定程 度上的局部排序,各个小组内的原始数据在局部上是有序的, 因此仅需要对各 个小组进行排序即可。通过该种方式, 能够大大的降低现有技术中在分组完成 之后,再对各个小组内的原始数据以及各个小组进行排序的复杂度, 减少因排 序消耗的时间。并且在这种局部有序的原始数据进行 join时,其排序复杂度低 于随机分配的原始数据进行 join时的排序复杂度。  In the meantime, in the process of performing the hash grouping by the hash connection disclosed in the first embodiment of the present invention, the raw data in each group is sorted after each hash packet is written into each corresponding group. After the final grouping is completed, when the final sorting is performed for each group, since the original data has been partially sorted in the process of the multiplex grouping disclosed in the embodiment of the present invention, the original data in each group is locally The order is up, so you only need to sort the groups. In this way, the complexity of sorting the original data and the individual groups in each group after the grouping is completed in the prior art can be greatly reduced, and the time consumed by the sorting is reduced. And when this locally ordered raw data is joined, the sorting complexity is lower than the sorting complexity when the randomly allocated raw data is joined.
实施例二  Embodiment 2
基于本发明实施例一公开的哈希连接方法,在本发明实施例二中主要针对 图 1示出的步骤 S103中提及的 N次 hash分组进行详细说明。  The hash connection method disclosed in the first embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the N times hash packets mentioned in step S103 shown in FIG.
基于预设分组规则依次对每一目标数据组中的数据段进行 N次 hash分组 中的第 1次 hash分组的过程包括:  The process of sequentially performing the first hash packet in the hash packet for each data segment in each target data group based on the preset grouping rule includes:
步骤 S1031 , 计算当前所述数据段内包含的原始数据的 hash值, 并用 bit 位表示计算所得 hash值;  Step S1031: Calculate a hash value of the original data included in the current data segment, and use a bit bit to represent the calculated hash value.
基于执行步骤 S102以一个 vector为数量单位划分目标数据组, 以所述目 标数据组中的任意一个数据段为例,在执行步骤 S1031时, 同时计算同一数据 段内所包含的各个原始数据的 hash值,并利用比特 bit位表示计算每个原始数 据所得的 hash值。如本申请实施例一中所述该比特 bit位与安装该数据库的计 算机本身的位数相关, 是由当前是计算机的 CPU最大寻址数决定的。 据划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照各个原始 数据在所述目标数据组中的位置在同一小组内进行排序和保存; 在执行步骤 SI 032的过程中, 在当前进行第 1次 hash次分组的过程中, 依据数据緩存 cache的大小和页表緩冲 TLB的存储大小, 确定当前 hash分组 所需要指定的 bit位,针对该数据段中的各个原始数据对应的利用 bit位表示的 hash值, 在划分小组的过程中, 将指定 bit位上取值相同的 hash值所对应的原 始数据划分在同一个小组内。 The target data group is divided by a vector in a quantity unit according to the execution step S102. Taking any one of the target data groups as an example, when performing step S1031, the hash of each original data included in the same data segment is simultaneously calculated. The value, and the bit value is used to represent the hash value obtained by calculating each raw data. As described in the first embodiment of the present application, the bit bit is related to the number of bits of the computer itself in which the database is installed, and is determined by the maximum number of CPUs currently being the CPU of the computer. According to the division into the same group, and sorting and saving the original data divided into the same group in the same group according to the position of each original data in the target data group; In the process of performing step SI 032, in the process of performing the first hash sub-packet, according to the size of the data cache cache and the storage size of the page table buffer TLB, the bit bits required for the current hash packet are determined, The hash value represented by the bit bit corresponding to each original data in the data segment is divided into the same group in the process of dividing the group by the original data corresponding to the hash value of the same bit position.
如, 在当前分组过程中需要两位 bit, 则针对当前用 bit位表示的 hash值 从最高位起向最低位方向指定两位, 在划分小组时, 将指定的前两位相同的 hash值对应的原始数据划分在同一个小组内。  For example, if two bits are needed in the current grouping process, the hash value represented by the current bit bit is specified from the highest bit to the lowest bit direction, and when the group is divided, the same first two bits of the same hash value are corresponding. The raw data is divided into the same group.
同时, 在依据指定的 bit位相同, 获知那些原始数据可以划分在同一个小 组内时, 利用该原始数据在目标数据组中的位置,在其当前所在的小组内进行 排序。 如, 在同一小组内包含原始数据: A、 B、 C, 其中 A的位置在目标数 据组的第 6位, B的位置在目标数据组的第 1位, C的位置在目标数据组的第 4位, 则执行步骤 S1033后获取到的保存后的小组内的原始数据的位置为: B、 C、 A, 这样使得每次划分得到的每个小组中的原始数据有序。  At the same time, when the original bit data can be divided into the same group according to the specified bit position, the position of the original data in the target data group is used to sort in the current group. For example, the original data is included in the same group: A, B, C, where A is at the 6th position of the target data group, B is at the 1st position of the target data group, and C is at the position of the target data group. 4 bits, the position of the original data in the saved group obtained after executing step S1033 is: B, C, A, so that the original data in each group obtained by each division is ordered.
步骤 S1033 , 将每一个原始数据对应的 hash值中未被指定的 bit位与该原 始数据进行关联, 并保存于各个 hash值对应的原始数据的关联位置处;  Step S1033: Associate the unspecified bit bits of the hash value corresponding to each original data with the original data, and save the associated bits of the original data corresponding to each hash value;
基于步骤 S1032,执行步骤 S1033在划分小组后,将该原始数据对应的 hash 值在该次 hash分组过程中未被使用的,或者未被指定的 bit位保存于该原始数 据的关联位置处。 其中, 该关联位置可以为与该原始数据相邻的存储空间, 也 可以是其他与该原始数据建立关联的存储空间。  Step S1032, after performing the step S1033, after the group is divided, the hash value corresponding to the original data is not used in the hash packet process, or the unspecified bit bit is saved at the associated position of the original data. The associated location may be a storage space adjacent to the original data, or may be another storage space associated with the original data.
针对目标数据组中的各个数据段执行完上述第 1次 hash分组之后, 若满 足预设分组规则, 则停止进行再次分组。 若不满足预设分组规则, 则继续对当 前第 1次 hash分组后的各个小组内的原始数据进行再次分组。  After the first hash packet is executed for each data segment in the target data group, if the preset grouping rule is satisfied, the re-grouping is stopped. If the preset grouping rule is not met, the original data in each group after the current first hash grouping is continued to be grouped again.
所述第 2次至第 n次 hash分组中对上一次 hash分组后得到的任意一小组 中的原始数据进行 hash分组, n取大于 2的正整数且包含于 N中。 上述基于 预设分组规则依次对每一目标数据组中的数据段进行 N次分组中的第 2次甚 至 n次 hash分组的过程包括:  The raw data in any one of the groups obtained after the last hash grouping in the second to nth hash packets is hash grouped, and n takes a positive integer greater than 2 and is included in N. The above process of sequentially performing the second or even nth hash packets in the N segments of the data segments in each target data group based on the preset grouping rules includes:
步骤 S1034,基于当前小组内的原始数据关联位置处所保存的上一次 hash 分组中未被指定的 bit位,将当前 hash分组过程中指定 bit位上取值相同的 hash 值所对应的各个原始数据划分在同一小组内,并对划分在同一小组内的各个原 始数据,按照各个原始数据在所述目标数据组中的位置对同一小组内的各个原 始数据进行排序并保存; Step S1034: According to the unspecified bit in the last hash packet saved in the original data association position in the current group, the original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process is divided. Within the same group, and for each of the original groups divided into the same group Starting data, sorting and saving each original data in the same group according to the position of each original data in the target data group;
在执行步骤 S1034的过程中, 依据在原始数据关联位置处保存的 bit位中 指定的, 当前进行 hash分组所需要用到的 bit位, 划分位于指定 bit位上取值 相同的 hash值所对应的原始数据为同一小组,同时,在依据指定的 bit位相同, 获知那些原始数据可以划分在同一个小组内时,利用该原始数据在目标数据组 中的位置, 在其当前所在的小组内进行排序。  In the process of performing step S1034, according to the bit bits required for the current hash packet specified in the bit position saved at the original data associated position, the hash value corresponding to the same value is assigned to the hash value of the same bit position. The original data is in the same group, and at the same time, according to the same bit position, when the original data can be divided into the same group, the position of the original data in the target data group is used, and the current data is sorted in the current group. .
步骤 S1035 , 将每一个原始数据关联的剩余的未被指定的 bit位再次保存 在所述原始数据的关联位置处;  Step S1035: Save the remaining unspecified bit bits associated with each original data again at the associated position of the original data;
在步骤 S1035中, 将剩余的未被指定的 bit位再次保存在所述原始数据的 关联位置处, 以备后续分组中使用。 结合步骤 S1032中的示例, 当前在原始数 据关联位置处保存的 bit位为执行步骤 S1032后剩余的未使用的 bit。若当前进 行 hash分组所需要用到的 bit位仍为两位, 同样的, 则指定的两位 bit位则为 从当前剩余的 bit位的最高位开始向最低位方向处所取的两位。  In step S1035, the remaining unspecified bit bits are again saved at the associated position of the original data for use in subsequent packets. In conjunction with the example in step S1032, the bit bit currently held at the original data associated position is the unused bit remaining after performing step S1032. If the bit bit currently used for the hash packet is still two bits, the same two bits are the two bits taken from the highest bit of the current remaining bit to the lowest bit.
在执行完步骤 S1034与步骤 S1035之后,若当前的分组情况不满足预设分 组规则, 则返回循环执行步骤 S1034和步骤 S1035 , 直至满足预设分组规则时 停止对当前目标数据组进行分组。  After the step S1034 and the step S1035 are performed, if the current grouping situation does not satisfy the preset grouping rule, the loop returns to step S1034 and step S1035 until the current grouping of the target data group is stopped.
通过执行步骤 S1031至步骤 S1035 ,对目标数据组进行满足预设分组规则 的分组, 并在每一次分组的过程中对划分在同一小组内的原始数据进行排序, 使得每一次进行 hash分组过程中得到分组结果虽然在整体上是无序的, 但是 在获得的每个小组内则是有序的,在这种局部有序的原始数据进行 join时,其 排序复杂度低于随机分配的原始数据进行 join时的排序复杂度。  By performing steps S1031 to S1035, the target data group is grouped to satisfy the preset grouping rule, and the original data divided in the same group is sorted in each grouping process, so that each time the hash grouping process is obtained Although the grouping results are disordered as a whole, they are ordered in each group obtained. When the local ordered raw data is joined, the sorting complexity is lower than the randomly assigned raw data. Sorting complexity when joining.
通过上述本发明实施例二具体公开的仅在第 1 次分组过程中计算各个原 始数据的 hash值,而将后续使用到的若干 bit位记录到对应的各个原始数据的 关联位置处以备后续分组过程中直接使用,从而省去重复计算 hash值的代价, 避免了资源浪费。 同时, 在每一次 hash分组之后, 将原始数据写入各个对应 的小组内之前, 对各个小组内的原始数据进行排序, 使得在最后 hash分组完 成之后,各个小组内的原始数据在局部上是有序的, 因此仅需要对目标数据组 hash分组后得到的各个小组进行排序即可。通过该种方式, 能够大大的降低现 有技术中在分组完成之后,再对各个小组内的原始数据以及各个小组进行排序 的复杂度, 减少因排序消耗的时间。 According to the second embodiment of the present invention, the hash value of each original data is calculated only in the first grouping process, and the subsequent used bits are recorded to the corresponding associated positions of the original data for the subsequent grouping process. Used directly in the middle, thus eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources. At the same time, after each hash grouping, before the original data is written into each corresponding group, the original data in each group is sorted, so that after the last hash group is completed, the original data in each group is partially Ordered, so only the groups obtained after grouping the target data group hash need to be sorted. In this way, the original data and the groups in each group can be sorted after the grouping is completed in the prior art. The complexity of reducing the time spent by sorting.
实施例三  Embodiment 3
基于本发明实施例一和实施例二公开的哈希连接方法,在本发明实施例二 中主要针对图 1示出的步骤 S103中提及的预设分组规则进行详细说明。  The method for the hash connection according to the first embodiment and the second embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the preset grouping rule mentioned in step S103 shown in FIG.
当所述预设分组规则是预设 hash分组次数 N时, 在依次对每一所述目标 数据组中的数据段进行 hash分组的过程中,直至完成 N次 hash分组后停止对 该目标数据组进行分组。 其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整数。  When the preset grouping rule is the preset number of hash packets N, in the process of performing hash grouping on the data segments in each of the target data groups, the target data group is stopped after completing the hash packets by N times. Group by. The value of N is determined by the storage size of the page table buffer TLB, and is a positive integer greater than or equal to 1.
示例一, 由页面緩冲 TLB的存储确定当前进行 hash分组的目标数据组需 要分 4次, 即 N取值为 4。 在进行完第 1次分组之后, 基于本发明实施例一中 公开的所述第 2次至第 n次 hash分组中对上一次 hash分组后得到的任意一小 组中的原始数据进行 hash分组的过程, 在执行至第 4次分组之后, 停止对该 目标数据组进行 hash分组。 此时, 得到的小组数即为该目标数据组的分组数。  In the first example, it is determined by the storage of the page buffer TLB that the target data group currently performing the hash packet needs to be divided into four times, that is, the value of N is 4. After performing the first grouping, the process of performing hash grouping on the original data in any one group obtained after the last hash grouping in the second to nth hash packets disclosed in the first embodiment of the present invention is performed. After the execution of the 4th grouping, the hash grouping of the target data group is stopped. At this time, the obtained number of groups is the number of groups of the target data group.
当所述预设分组规则是预设分组总数 S时,依次对每一所述目标数据组中 的数据段进行 hash分组, 直至每一所述目标数据组的分组总数等于预设分组 总数 S, 停止对该目标数据组进行分组。 该 S的取值由数据库緩存 cache的大 小决定, 为大于等于 2的正整数。  When the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the total number of packets of each of the target data groups is equal to the total number S of preset packets. Stop grouping the target data group. The value of S is determined by the size of the database cache cache and is a positive integer greater than or equal to 2.
示例二, 由数据库緩存 cache的大小决定的当前目标数据组可分的预设分 组总数为 10时, 针对当前目标数据组进行第 1次 hash分组, 当该第 1次 hash 分组完成后, 得到的分组数小于 10, 则继续进行 hash分组, 直至当前目标数 据组的分组数达到 10之后停止 hash分组。  Example 2: When the total number of preset packets that can be divided by the current target data group determined by the size of the database cache cache is 10, the first hash packet is performed for the current target data group, and after the first hash packet is completed, the obtained If the number of packets is less than 10, the hash packet is continued until the number of packets of the current target data group reaches 10, and the hash packet is stopped.
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级高于预设分组总数 S时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至完成 N次 hash分组;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, sequentially for each of the target data groups The data segment is hashed until the hash packet is completed N times;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 分组总数 S的优先级高于预设 hash分组次数 N时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至每一所述目标数据组的分组数等于预设分 组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级和预设分组总数 S的优先级一致, 依次对每一所述 目标数据组中的数据段进行 hash分组,直至完成 N次 hash分组且每一所述目 标数据组的分组数等于预设分组总数 S; When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is consistent with the priority of the preset packet total S, sequentially for each of the The data segment in the target data group is hashed until the hash packet is completed N times and the number of packets of each of the target data groups is equal to the total number of preset packets S;
其中,所述预设 hash分组次数 N与预设分组总数 S的优先级由 TLB的存 储大小和 cache的大小决定。  The priority of the preset hash packet number N and the preset total number S is determined by the storage size of the TLB and the size of the cache.
示例三, 由页表緩冲 TLB 的存储大小决定的预设分组次数为 3 , 由数据 库緩存 cache的大小决定的预设分组总数为 16。 当预设 hash分组次数 N的优 先级和预设分组总数 S的优先级一致时,则基于该预设分组次数对目标数据组 进行 3次分组之后得到的分组总数正好为 16; 当预设 hash分组次数 N的优先 级高于预设分组总数 S时,此时基于该预设分组次数对目标数据组进行 3次分 组之后, 可能存在的情况是, 得到的分组总数小于 16, 或者等于 16, 或者大 于 16; 当预设分组总数 S的优先级高于预设 hash分组次数 N时, 此时在分组 的过程中, 可能存在的情况是, 在得到分组总数为 16时, 针对该目标数据组 的分组次数大于 3次, 或者小于 3次, 或者等于 3次。  Example 3: The preset number of packets determined by the storage size of the page table buffer TLB is 3, and the total number of preset packets determined by the size of the database cache cache is 16. When the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the total number of packets obtained after the target data group is grouped 3 times based on the preset packet number is exactly 16; when the default hash is obtained When the priority of the number of packets N is higher than the total number of preset packets S, after the target data group is grouped 3 times based on the preset number of packets, there may be a case where the total number of packets obtained is less than 16, or equal to 16, Or greater than 16; when the priority of the preset total number S is higher than the preset hash packet number N, in the process of grouping, there may be a case where, when the total number of packets is 16, the target data group is obtained. The number of groupings is greater than 3 times, or less than 3 times, or equal to 3 times.
当所述预设分组规则包括: 预设 hash分组次数 N, 预设的每一次 hash分 组的分组数 m和预设分组总数 S; 其中, N的取值由页表緩冲 TLB的存储大 小决定, 为大于等于 1的正整数, m小于 N; S的取值由数据库緩存 cache的 大小决定, 为大于等于 2的正整数; 在依次对每一所述目标数据中的数据段进 行 hash分组时,按照预设的每一次 hash分组的分组数 m进行分组, 使得最后 的分组次数等于预设 hash分组次数 N,所分的小组的总数等于预设分组总数8。  The preset grouping rule includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, the value of N is determined by the storage size of the page table buffer TLB , is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, is a positive integer greater than or equal to 2; when hashing the data segments in each of the target data in turn The packet is grouped according to the preset number m of packets of each hash packet, so that the last number of packets is equal to the preset hash packet number N, and the total number of divided groups is equal to the total number of preset packets 8.
示例四, 如图 2所示, 由页表緩冲 TLB的存储大小决定的预设分组次数 为 3 , 每一次 hash分组的分组数为 2, 由数据库緩存 cache的大小决定的预设 分组总数为 16。 在以 vector为数量单位划分为 2个数据段的该目标数据组中 , 在第 1次 hash分组过程中分别将每个数据段再划分为 2个小组, 并分别写入 对应的小组内; 然后在第 2次 hash分组过程中将前一次分组后的每个小组再 次划分为 2个数据段并分别写入对应的小组内,依次类推直至对该目标数据组 执行完 3次 hash分组并得到 16个小组。  Example 4, as shown in FIG. 2, the preset number of packets determined by the storage size of the page table buffer TLB is 3, the number of packets per hash packet is 2, and the total number of preset packets determined by the size of the database cache cache is 16. In the target data group divided into two data segments in the unit of vector, each data segment is subdivided into two groups in the first hash grouping process, and respectively written into the corresponding group; In the second hash grouping process, each group after the previous grouping is again divided into two data segments and written into the corresponding groups, and so on until the hash group is executed for the target data group and 16 is obtained. Groups.
在本发明实施例二中主要对图 1示出的步骤 S103中提及的在进行 hash分 组过程中基于的预设分组规则进行说明。该预设分组规则主要基于按照该数据 库的计算机中的页表緩冲 TLB的存储大小, 以及数据库緩存 cache的大小决 定, 基于该预设分组规则能够避免在分组的过程中出现 cache miss的情况, 进 而提高后续 Join的性能。 In the second embodiment of the present invention, the preset grouping rule based on the hash grouping process mentioned in step S103 shown in FIG. 1 is mainly explained. The preset grouping rule is mainly determined based on the storage size of the page table buffer TLB in the computer according to the database, and the size of the database cache cache. Based on the preset grouping rule, the cache miss may be avoided during the grouping process. Enter Improve the performance of subsequent Join.
实施例四  Embodiment 4
基于本发明实施例一至实施例三公开的一种哈希连接方法, 其中,针对图 1中示出的步骤 S102,所述以矢量 vector为数量单位将每一目标数据组划分为 多个数据段数, 其具体过程包括:  A hash connection method according to the first embodiment to the third embodiment of the present invention, wherein, for step S102 shown in FIG. 1, the target data group is divided into a plurality of data segments by a vector vector. The specific process includes:
以矢量 vector为数量单位, 一个 vector对应一个数据段, 顺序将每一目标 数据组划分为 M个数据段, M的取值由所述目标数据组内的原始数据的个数, 及数据库緩存 cache的大小和页表緩冲 TLB的存储大小决定;  The vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;
其中, 第 1至第 M-1个数据段中所包含的原始数据的个数相同, 第 M个 数据段中所包含的原始数据的个数小于或等于第 1至 M-1个数据段中所包含 的原始数据的个数。  The number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments. The number of raw data contained.
假设需要进行 hash分组的目标数据组中总共包含有 25个原始数据, 以 vector为数量单元, 该 vector数量单位中包含 5个原始数据, 使 5个原始数据 构成一个数据段。 以该 vector数量单位划分包含有 25个原始数据的目标数据 组, 可划分为 5个数据段。 第 1个至第 5个数据段中所包含的原始数据相同, 如图 3所示给出的为每个数据段中所包含的原始数据个数相同的情况。  Suppose that the target data group that needs to be hashed contains a total of 25 original data, with a vector as a quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment. The target data group containing 25 raw data is divided into five data segments by dividing the target data unit by the number of vectors. The original data contained in the 1st to 5th data segments is the same, as shown in Fig. 3, the case where the number of original data included in each data segment is the same.
假设需要进行 hash分组的目标数据组中总共包含有 28个原始数据, 以 vector为数量单元, 该 vector数量单位中包含 5个原始数据, 使 5个原始数据 构成一个数据段。 以该 vector数量单位划分包含有 28个原始数据的目标数据 组, 可划分为 6个数据段。 第 1个至第 5个数据段中所包含的原始数据相同, 第 6个数据段中包含 3个原始数据,小于第 1个值第 5个数据段中包含的原始 数据个数。  Suppose that the target data group that needs to be hashed contains a total of 28 original data, with a vector as the quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment. The target data group containing 28 raw data is divided into six data segments by the vector number unit. The original data contained in the 1st to 5th data segments is the same, and the 6th data segment contains 3 raw data, which is smaller than the original data contained in the 5th data segment of the 1st value.
基于本发明实施例二公开的一种哈希连接方法, 其中,针对上述公开的步 划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照各个原始数 据在所述目标数据组中的位置对同一小组内的各个原始数据进行排序并保存 , 其具体过程如图 4所示, 包括: 的 hash值;  A hash connection method according to Embodiment 2 of the present invention, wherein the steps disclosed in the above disclosure are divided into the same group, and each piece of original data divided in the same group is in the target data group according to each original data. The location in the same group sorts and saves the original data in the same group. The specific process is shown in Figure 4, including: the hash value;
步骤 S202, 查找位于当前 hash分组过程中指定 bit位上取值相同的 hash 值对应的各个原始数据, 将各个原始数据划分在同一小组内; Step S202, searching for a hash with the same value in the specified bit position in the current hash grouping process. Each raw data corresponding to the value divides each original data into the same group;
基于步骤 S201 中获取的当前进行 hash分组的数据段中各个原始数据的 hash值, 该 hash值用 bit位表示。 在步骤 S202中, 查找指定 bit位上的 hash 值。 该指定 bit位可以是在进行本次分组之前依据数据库緩存 cache的大小和 页表緩冲 TLB的存储大小指定的; 也可以在接收到需要进行 hash分组时, 依 据数据库緩存 cache的大小和页表緩冲 TLB的存储大小对后续进行分组过程 中所需要使用到的 bit位进行指定, 当在进行本次分组时, 则不需要再重新指 定, 直接在本次 hash分组所需要使用的 bit位上查找即可。  Based on the hash value of each original data in the data segment of the current hash packet obtained in step S201, the hash value is represented by a bit. In step S202, the hash value on the specified bit is looked up. The specified bit bit may be specified according to the size of the database cache cache and the storage size of the page table buffer TLB before the current packet is performed; or may be based on the size and page table of the database cache cache when receiving the hash packet needs to be received. The storage size of the buffered TLB is used to specify the bit bits to be used in the subsequent grouping process. When this grouping is performed, there is no need to re-specify, directly in the bit position required for this hash packet. Find it.
步骤 S203 , 遍历将划分在同一小组内的各个原始数据的下标, 所述各个 原始数据的下标用于标识各个原始数据在所述目标数据组中的位置;  Step S203, traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;
步骤 S204 , 按照各个下标的大小, 从小至大排列各个下标对应的原始数 据;  Step S204: Arrange the original data corresponding to each subscript from small to large according to the size of each subscript;
步骤 S205 , 依据所述从小至大的顺序将各个原始数据写入同一小组内并 保存。  Step S205: Write each original data into the same group and save according to the sequence from small to large.
执行上述步骤 S203至步骤 S205在分组的过程中对将划分在同一小组内的 原始数据进行排序并写入同一小组内保存,使得在对该目标数据组分组的过程 中局部有序。 具体过程举例说明, 在进行 hash分组时, 以 vector为数量单位 的一段数据(如图 5中的虚线框所示 ),对该数据段中的原始数据一起计算 hash 值。 如图 5所示, value为参与 join的真实值, 图 5中 position代表各个原始 数据在整个数据段中的位置, position-1 代表经过整理后分在同一组的各个原 始数据的下标, hash value代表对应原始数据的 hash值。  The above steps S203 to S205 are performed to sort the original data divided in the same group and write them in the same group during the grouping process, so that the order is locally ordered in the process of the target data group. For example, in the process of hash grouping, a piece of data in units of vectors (shown by a dashed box in FIG. 5) is used to calculate a hash value together with the original data in the data segment. As shown in Fig. 5, value is the real value of the participating join, position in Figure 5 represents the position of each original data in the entire data segment, and position-1 represents the subscript of each original data that is sorted and sorted in the same group, hash Value represents the hash value corresponding to the original data.
在分组的过程中,遍历位于指定 bit位上取值相同的 hash值将其下标保存 到 position-1对应的小组中, 然后,依次遍历 position-1中保存的下标, 并将该 下标对应的原始数据写入到对应的小组中。  In the process of grouping, traversing the hash value with the same value on the specified bit position, saving the subscript to the group corresponding to position-1, and then traversing the subscript saved in position-1 in turn, and substituting the subscript The corresponding raw data is written to the corresponding group.
通过执行上述步骤 S203至步骤 S205 , 在分组的过程中, 在原始数据写入 当前小组的同时, 对本次需要写入当前小组的原始数据进行排序。 在该 vector 单位执行完上述分组之后,对下一个相邻的 vector进行如上操作, 直到该目标 数据组中的所有 vector都执行完本次 hash分组。进而得到该目标数据组的第 1 次 hash分组后各个在局部有序的小组, 从而分担最终对各个小组进行排序时 还要对其内部的原始数据进行排序的负担, 实现了降低分组复杂度的目的。 针对当前进行分组的目标数据组中的各个数据段都按照上述方式执行完 第 1次 hash分组之后, 若当前的分组满足预设分组规则, 则停止进行再次分 组。 若不满足预设分组规则, 则继续对当前第 1次 hash分组后的各个小组内 的原始数据进行再次分组。基于本发明实施例二公开的一种哈希连接方法, 其 中, 针对上述公开的步骤 S1034, 基于当前小组内的原始数据关联位置处所保 存的上一次 hash分组中未被指定的 bit位, 将当前 hash分组过程中指定 bit位 上取值相同的 hash值所对应的各个原始数据划分在同一小组内, 并对划分在 同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在 同一小组内进行排序并保存, 其具体过程如图 6所示, 包括: By performing the above steps S203 to S205, in the process of grouping, the original data that needs to be written into the current group is sorted while the original data is written into the current group. After the vector unit performs the above grouping, the next adjacent vector is operated as above until all the vectors in the target data group have completed the current hash group. In turn, the local hash group after the first hash group of the target data group is obtained, thereby sharing the burden of sorting the original data in the final sorting of each group, thereby realizing the reduction of group complexity. purpose. After the first hash packet is executed in the above manner for each data segment in the target data group currently grouped, if the current packet satisfies the preset packet rule, the re-grouping is stopped. If the preset grouping rule is not satisfied, the original data in each group after the current first hash grouping is continued to be grouped again. A hash connection method according to the second embodiment of the present invention, wherein, for the step S1034 disclosed above, based on the unspecified bit in the last hash packet saved at the original data association location in the current group, the current Each raw data corresponding to the same hash value in the specified bit position in the hash grouping process is divided into the same group, and each original data divided in the same group is in the target data group according to each original data. The locations are sorted and saved in the same group. The specific process is shown in Figure 6, including:
步骤 S301 , 调用当前进行 hash分组的小组内各个原始数据关联位置处所 保存的上一次 hash分组中未被指定的 bit位;  Step S301: Calling an unspecified bit bit in the last hash packet saved at each original data association location in the group currently performing the hash packet;
在执行步骤 S301的过程中, 当前小组为在上一次 hash分组之后获得的各 个小组中的任意一个小组,调用当前小组内所述原始数据关联位置处所保存的 上一次 hash分组中未被指定的 bit位,是为了进一步的当前小组进行再次 hash 分组。  In the process of performing step S301, the current group calls any one of the groups obtained after the last hash group, and calls the unspecified bit in the last hash packet saved in the original data association position in the current group. Bit, is for further current group to perform hash grouping again.
步骤 S302, 从调用的所述未被指定的 bit位中确定当前 hash分组过程中 所需用到的 bit位, 其中, 当前 hash分组过程中所需用到的 bit位依据数据库 緩存 cache的大小和页表緩冲 TLB的存储大小决定;  Step S302, determining, according to the unspecified bit bit of the call, a bit bit required for the current hash packet process, where the bit bit required in the current hash packet process is based on the size of the database cache cache and The storage size of the page table buffer TLB is determined;
步骤 S303 , 查找位于当前 hash分组过程中指定 bit位上取值相同的 hash 值对应的各个原始数据, 将各个原始数据划分在同一小组内。  Step S303: Find each original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process, and divide each original data into the same group.
步骤 S304 , 遍历将划分在同一小组内的各个原始数据的下标, 所述各个 原始数据的下标用于标识各个原始数据在所述目标数据组中的位置;  Step S304, traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;
步骤 S305 , 按照各个下标的大小, 从小至大排列各个下标对应的各个原 始数据;  Step S305: Arrange, according to the size of each subscript, each original data corresponding to each subscript from small to large;
步骤 S306 , 依据所述从小至大的顺序将各个原始数据写入同一小组内并 保存。  Step S306: Write each original data into the same group and save according to the sequence from small to large.
上述步骤 S304至步骤 S306中对将划分在同一小组内的原始数据的排序过 程与上述附图 4中的步骤 S203至步骤 S205相同,具体说明可参照,这里不再 进行赘述。  The sorting process of the original data divided into the same group in the above steps S304 to S306 is the same as the step S203 to the step S205 in the above-mentioned FIG. 4, and the detailed description is not mentioned here.
针对该目标数据组前一次 hash分组得到的各个小组执行上述步骤 S301至 步骤 S306,从而获得再次 hash分组后的内部原始数据有序的新小组, 同样的, 在每次 hash分组完之后, 若当前的 hash分组满足预设分组规则, 则停止 hash 分组。 若不满足预设分组规则, 则执行步骤 S301至步骤 S303再次对前一次 hash分组得到的各个小组进行分组, 直至满足预设分组规则。 Performing the above step S301 to each group obtained by the previous hash group of the target data group Step S306, thereby obtaining a new group with internal raw data ordered after hashing again. Similarly, after each hash packet is finished, if the current hash packet satisfies the preset grouping rule, the hash packet is stopped. If the preset grouping rule is not satisfied, step S301 to step S303 are performed to group the groups obtained by the previous hash group again until the preset grouping rule is satisfied.
基于上述本发明实施例一至本发明实施例三中公开的一种哈希连接方法, 其中, 针对上述公开的步骤 S105 , 所述按照排序依次取所述两个待连接的目 标数据组中经由 N次 hash分组后获得的各个小组中的原始数据进行 Join操作, 具体过程包括:  A hash connection method according to the above-mentioned first embodiment of the present invention to the third embodiment of the present invention, wherein, in step S105 of the above disclosure, the two target data groups to be connected are sequentially obtained by N in sequence The raw data in each group obtained after the hash group is joined, and the specific process includes:
步骤 S501 , 按顺序分别获取所述待连接的两个目标数据组进行 N次 hash 分组后的各个小组;  Step S501: Acquire, in sequence, each of the two target data groups to be connected to perform N times hash grouping;
在至少两个待连接的目标数据组都按照上述步骤 S102至步骤 S104进行 hash分组之后, 执行步骤 S501 , 获取待连接的两个目标数据组中的各个小组。  After the at least two target data groups to be connected are hashed according to the foregoing steps S102 to S104, step S501 is executed to obtain each group in the two target data groups to be connected.
步骤 S502, 两两小组为一对进行原始数据 Join操作的方式, 对两个目标 数据组的各个小组中原始数据进行 Join操作;  Step S502: The two groups perform a Join operation on the original data in each group of the two target data groups in a manner of performing a raw data Join operation.
针对进行 hash分组之后的两个待连接的目标数据组, 按照两两小组为一 对进行原始数据 Join操作, 对两个待连接的目标数据组中各个小组中的原始 数据进行 Join操作的方式, 如图 7所示包括:  For the two target data groups to be connected after the hash grouping, the raw data join operation is performed according to the pair of two groups, and the original data in each group in the two target data groups to be connected is joined. As shown in Figure 7, it includes:
步骤 S503 , 由一目标数据组中的一小组顺序遍历另一目标数据组中的各 个小组;  Step S503, sequentially traversing each group in another target data group by a group in a target data group;
步骤 S504 , 判断当前小组是否在另一目标数据组中遍历到相同的小组, 若是, 则执行步骤 S505 , 若否, 则执行步骤 S507;  Step S504, it is determined whether the current group traverses to the same group in another target data group, and if so, step S505 is performed, and if no, step S507 is performed;
步骤 S505 , 若遍历到相同小组时, 将所述小组中的原始数据, 顺序与所 述相同小组内的原始数据进行 Join操作, 其中, 所述相同小组是指该小组内 存储的原始数据的 hash值与用于遍历的小组内存储的原始数据的 hash值相 同;  Step S505, if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group, wherein the same group refers to the hash of the original data stored in the group. The value is the same as the hash value of the raw data stored in the group used for traversal;
步骤 S506, 判断当前进行 Join操作的两个小组中, 任意一方中的原始数 据是否都已经执行 Join操作, 若是, 则执行步骤 S507 , 若否, 则继续执行两 个小组内的原始数据的 Join操作, 并返回执行步骤 S506;  Step S506, determining whether the original data in any one of the two groups currently performing the Join operation has performed the Join operation, and if yes, executing step S507, and if not, continuing to perform the Join operation of the original data in the two groups. And returning to step S506;
步骤 S507 , 移动至下一小组返回执行步骤 S503;  Step S507, moving to the next group returns to step S503;
循环执行上述步骤 S503至步骤 S507 ,直至所述目标数据组中的所有小组 对另一目标数据组中的各个小组都执行遍历操作。 Cycling the above steps S503 to S507 until all the groups in the target data group A traversal operation is performed on each group in another target data group.
在本发明实施例中 hash连接进行分组以及 Join过程中所需执行的过程。 以 vector为数量单位,仅在第 1次分组过程中对每个 vector单元内的原始数据 同时计算 hash值, 然后将同一个分组中所包含的若干个原始数据对应的 hash 值一次性写入对应的分组中。 而将后续使用到的若干 bit位记录到对应的各个 原始数据的关联位置处以备后续分组过程中直接使用,从而省去重复计算 hash 值的代价, 避免了资源浪费。  In the embodiment of the present invention, the hash connection performs grouping and the process required to be performed in the Join process. In the vector unit, the hash value is calculated simultaneously for the original data in each vector unit in the first grouping process, and then the hash values corresponding to the plurality of original data included in the same group are written to the corresponding one-time. In the grouping. The subsequent use of a number of bit bits is recorded to the corresponding location of the corresponding original data for use in the subsequent grouping process, thereby eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources.
同时, 在本发明实施例中, 在每一次 hash分组之后, 将原始数据写入各 个对应的小组内之前,对各个小组内的原始数据进行排序, 以及针对本次 hash 分组完成后对各个小组进行排序,使得在最后分组完成之后,对各个小组做最 后的排序时能够实现降低对小组以及小组内部数据进行排序的负担,减少因排 序消耗的时间的目的。  Meanwhile, in the embodiment of the present invention, after each hash packet is written, the original data in each group is sorted before the original data is written into each corresponding group, and each group is performed after the completion of the hash grouping. Sorting, so that after the final grouping is completed, the final sorting of each group can reduce the burden of sorting the data of the group and the internal data of the group, and reduce the time consumed by the sorting.
实施例五  Embodiment 5
针对上述本发明实施例一至本发明实施例四公开且详细描述的哈希连接 管理系统, 下面给出具体的实施例进行详细说明。  A specific embodiment of the hash connection management system disclosed in the above-described first embodiment of the present invention to the fourth embodiment of the present invention will be described in detail below.
如图 8所示, 该哈希连接装置, 应用于数据库, 主要包括: 接收单元 101 , 划分单元 102 , 分组单元 103 , 排序单元 104和连接单元 105。  As shown in FIG. 8, the hash connection apparatus is applied to a database, and mainly includes: a receiving unit 101, a dividing unit 102, a grouping unit 103, a sorting unit 104, and a connecting unit 105.
接收单元 101 , 用于接收包含有连接 Join操作的结构化查询语言 SQL语 句, 解析获取至少两个待连接的目标数据组;  The receiving unit 101 is configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;
在执行接收单元 101之后,针对解析获取到的每个目标数据组, 进行后续 的划分单元 102 ,分组单元 103和排序单元 104经历划分,分组以及排序之后, 进入连接单元 105 , 使分组后的待连接的两个目标数据组执行 Join操作。  After the receiving unit 101 is executed, for each target data group obtained by parsing, a subsequent dividing unit 102 is performed, and the grouping unit 103 and the sorting unit 104 undergo division, grouping and sorting, and then enter the connecting unit 105 to make the grouped waiting. The two target data groups connected perform a Join operation.
划分单元 102, 用于以矢量 vector为数量单位将每一目标数据组划分为多 个数据段;  The dividing unit 102 is configured to divide each target data group into multiple data segments by using a vector vector as a quantity unit;
分组单元 103 , 用于基于预设分组规则依次对每一目标数据组中的数据段 进行 N次哈希 hash分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组 计算所述数据段中的原始数据所得的用 bit位表示的 hash值, 将当前 hash分 组过程中指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组 内, 并对划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数 据组中的位置在同一小组内进行排序并保存, N取大于或等于 1的正整数; 排序单元 104 , 用于对每一目标数据组经过 N次 hash分组后获得的小组, 在所述目标数据组中, 按照各个小组中所包含的原始数据对应的 hash值由小 至大对各个 d、组进行排序; The grouping unit 103 is configured to perform N times hash hash grouping on the data segments in each target data group in sequence according to a preset grouping rule, where the data segment is calculated based on the first hash group each time the hash grouping is performed. The hash value represented by the bit in the original data is divided into the same group by the hash data corresponding to the same bit value in the current hash grouping process, and is divided into the same group. Raw data, according to each raw data in the target number Sorting and saving in the same group according to the position in the group, N takes a positive integer greater than or equal to 1; Sorting unit 104 is used to obtain a group obtained after N times hash grouping for each target data group, at the target In the data group, the ds and groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
连接单元 105 , 用于按照排序依次取所述两个待连接的目标数据组中经过 a connecting unit 105, configured to sequentially take the two target data groups to be connected according to the sorting
N次 hash分组后获得的各个小组中的原始数据进行 Join操作。 The raw data in each group obtained after N hash packets is joined.
其中, 所述分组单元 103包括: 由上至下对所述目标数据组中的数据段进 行第 1次 hash分组一次 hash分组模块 1031 ; 以及, 对上一次 hash分组后得 到的任意一小组中的原始数据进行第 2次至第 n次 hash分组的多次 hash分组 模块 1032, n取大于 2的正整数;  The grouping unit 103 includes: a first hash grouping and a hash grouping module 1031 for the data segments in the target data group from top to bottom; and, in any group obtained after the last hash grouping The plurality of hash grouping modules 1032 of the second to nth hash packets of the original data, n taking a positive integer greater than two;
所述一次 hash分组模块 1031 , 用于计算当前所述数据段内包含的原始数 据的 hash值, 并用比特 bit位表示计算所得 hash值; 将位于指定 bit位上取值 相同的 hash值所对应的原始数据划分在同一小组内, 并对划分在同一小组内 的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小组内 进行排序和保存;将每一个原始数据对应的 hash值中未被指定的 bit位与该原 始数据进行关联并保存;  The primary hash grouping module 1031 is configured to calculate a hash value of the original data included in the current data segment, and use the bit bit to represent the calculated hash value; and the hash value corresponding to the same bit position is corresponding to the hash value. The original data is divided into the same group, and each raw data divided into the same group is sorted and saved in the same group according to the position of each original data in the target data group; the hash corresponding to each original data is The unspecified bit in the value is associated with the original data and saved;
所述多次 hash分组模块 1032, 用于基于当前小组内的原始数据所关联并 保存的上一次 hash分组中未被指定的 bit位, 将当前 hash分组过程中指定 bit 位上取值相同的 hash值所对应的各个原始数据划分在同一小组内, 并对划分 在同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置 在同一小组内进行排序和保存; 将每一个原始数据关联的剩余的未被指定的 bit位再次保存。  The multiple hash grouping module 1032 is configured to: use the unspecified bit in the last hash packet associated with and saved by the original data in the current group, and set the hash with the same value in the specified bit position in the current hash grouping process. The original data corresponding to the value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group; The remaining unspecified bits of the original data association are saved again.
上述具体过程以及执行的原理可参见上述本发明实施例一和本发明实施 例二公开的内容, 这里不再进行赞述。 需要说明的是, 分组单元 103基于不同 的预设分组规则其所执行的内容也有所不同。  For the specific process and the principle of the above, refer to the disclosure of the first embodiment of the present invention and the second embodiment of the present invention, and no further description is made herein. It should be noted that the content performed by the grouping unit 103 based on different preset grouping rules is also different.
当所述预设分组规则是预设 hash分组次数 N时, 所述分组单元, 用于依 次对每一所述目标数据组中的数据段进行 hash分组,直至完成 N次 hash分组; 当所述预设分组规则是预设分组总数 S时,所述分组单元,用于依次对每 一所述目标数据组中的数据段进行 hash分组, 直至每一所述目标数据组的分 组数等于预设分组数; 当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级高于预设分组总数 S时, 所述分组单元, 用于依次 对每一所述目标数据组中的数据段进行 hash分组, 直至完成 N次 hash分组; 当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 分组总数 S的优先级高于预设 hash分组次数 N时, 所述分组单元, 用于依次 对每一所述目标数据组中的数据段进行 hash分组, 直至每一所述目标数据组 的分组数等于预设分组总数 S; When the preset grouping rule is a preset hash packet number N, the grouping unit is configured to perform hash grouping on data segments in each of the target data groups in sequence, until N times hash packets are completed; When the preset grouping rule is the preset total number S of packets, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of groups of each target data group is equal to a preset. Number of groups When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, the grouping unit is used to sequentially The data segment in the target data group is hashed until the hash packet is completed N times; when the preset packet rule is the preset hash packet number N and the preset packet total number S, and the preset packet total S is prioritized When the level is higher than the preset hash packet number N, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of packets of each target data group is equal to a preset group. Total number S;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级和预设分组总数 S的优先级一致, 所述分组单元, 用于依次对每一所述目标数据组中的数据段进行 hash分组,直至完成 N次 hash 分组且每一所述目标数据组的分组数等于预设分组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the grouping unit is used to sequentially Performing a hash grouping on the data segments in each of the target data groups until the N times hash packets are completed and the number of packets of each of the target data groups is equal to the preset total number S of packets;
当所述预设分组规则包括预设 hash分组次数 N, 预设的每一次 hash分组 的分组数 m和预设分组总数 S时,所述分组单元,用于按照预设的每一次 hash 分组的分组数进行分组, 使得最后的分组次数等于预设 hash分组次数, 所分 的小组的总数等于预设分组总数;  When the preset grouping rule includes a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S, the grouping unit is configured to group each hash according to a preset The number of packets is grouped such that the last number of packets is equal to the number of preset hash packets, and the total number of groups divided is equal to the total number of preset packets;
其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整 数, N包含 n , m小于 N; S的取值由数据库緩存 cache的大小决定, 为大于 等于 2的正整数;所述预设 hash分组次数 N与预设分组总数 S的优先级由 TLB 的存储大小和 cache的大小决定。  The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, N contains n, and m is less than N; the value of S is determined by the size of the database cache cache, which is greater than or equal to 2. A positive integer; the priority of the preset hash packet number N and the preset total number of packets S is determined by the storage size of the TLB and the size of the cache.
上述分组单元 103 所对应的不同预设分组规则的示例可参见本发明实施 例三中给出的示例, 这里不再进行赞述。  For an example of the different preset grouping rules corresponding to the grouping unit 103, refer to the example given in the third embodiment of the present invention, and no further description is made here.
需要说明的是, 上述图 8中示出的所述划分单元 102 , 其执行过程以及原 理与上述本发明实施例四中对应公开的 "以矢量 vector为数量单位将每一所述 目标数据组划分为多个数据段"说明部分相同, 这里不再进行赘述, 其主要包 括:  It should be noted that, the execution unit and the principle of the dividing unit 102 shown in FIG. 8 are divided into the above-mentioned "the vector vector is a quantity unit to divide each of the target data groups. The descriptions for the multiple data segments are the same, and are not described here. They mainly include:
第一划分模块, 用于以矢量 vector为数量单位, 一个 vector对应一个数据 段, 顺序将每一目标数据组划分为 M个数据段, M的取值由所述目标数据组 内的原始数据的个数, 及数据库緩存 cache的大小和页表緩冲 TLB的存储大 小决定;  a first dividing module, configured to use a vector vector as a quantity unit, a vector corresponding to a data segment, and sequentially dividing each target data group into M data segments, wherein the value of M is determined by the original data in the target data group The number, and the size of the database cache cache and the storage size of the page table buffer TLB;
其中, 第 1至第 M-1个数据段中所包含的原始数据的个数相同, 第 M个 数据段中所包含的原始数据的个数小于或等于第 1至 M-1个数据段中所包含 的原始数据的个数。 Wherein, the number of original data included in the first to the M-1th data segments is the same, the Mth The number of original data included in the data segment is less than or equal to the number of original data contained in the first to M-1 data segments.
需要说明的是,所述用于将位于指定 bit位上取值相同的 hash值所对应的 原始数据划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照各 个原始数据在所述目标数据组中的位置,在同一小组内进行排序和保存的所述 一次 hash分组模块 1031 , 其具体执行过程以及原理可参见上述本发明实施例 三中公开的第 1次 hash详细说明部分, 这里不再进行赞述, 其主要包括: 位表示的 hash值;  It should be noted that the original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and each original data divided in the same group is in accordance with each original data. For the location of the target data group, the first hash grouping module 1031 that is sorted and saved in the same group, the specific execution process and the principle can be referred to the first hash detailed description section disclosed in the third embodiment of the present invention. There is no longer a comment here, which mainly includes: the hash value represented by the bit;
第一查找子模块,用于查找位于当前 hash分组过程中指定 bit位上取值相 同的 hash值对应的各个原始数据, 将各个原始数据划分在同一小组内, 其中, 依据数据库緩存 cache的大小和页表緩冲 TLB的存储大小指定当前 hash分组 所需用到的 bit位;  The first search sub-module is configured to search for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and divide each original data into the same group, wherein, according to the size of the database cache cache and The storage size of the page table buffer TLB specifies the bit bits needed for the current hash packet;
第一遍历子模块, 用于遍历划分在同一小组内的各个原始数据的下标, 所 述各个原始数据的下标用于标识各个原始数据在所述目标数据组中的位置; 第一排列子模块, 用于按照各个下标的大小,从小至大排列各个下标对应 的原始数据;  a first traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;
第一排序子模块,用于依据所述从小至大的顺序将各个原始数据写入同一 小组内并保存。  The first sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
需要说明的是, 所述基于当前小组内的原始数据所关联并保存的上一次 hash分组中未被指定的 bit位, 将当前 hash分组过程中指定 bit位上取值相同 的 hash值所对应的各个原始数据划分在同一小组内, 并对划分在同一小组内 的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小组内 进行排序和保存的所述多次 hash分组模块 1032, 其具体执行过程以及原理可 参见上述本发明实施例一至实施例四中公开的多次 hash分组详细说明部分, 这里不再进行赘述, 其主要包括:  It should be noted that, according to the unspecified bit in the last hash packet associated with and saved by the original data in the current group, the hash value corresponding to the same bit in the current hash grouping process is corresponding to the hash value. Each of the original data is divided into the same group, and each of the original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group. 1032, the specific implementation process and the principle can be referred to the detailed description of the multiple hash packets disclosed in the above-mentioned first embodiment to the fourth embodiment of the present invention, and details are not described herein.
调用子模块, 用于调用当前进行 hash分组的小组内各个原始数据关联位 置处所保存的上一次 hash分组中未被指定的 bit位;  Calling a sub-module for invoking an unspecified bit in the last hash packet saved at each original data associated location within the group currently performing the hash packet;
确定子模块,用于从调用的所述未被指定的 bit位中确定当前 hash分组过 程中所需用到的 bit位, 其中, 当前 hash分组过程中所需用到的 bit位依据数 据库緩存 cache的大小和页表緩冲 TLB的存储大小决定; Determining a sub-module, configured to determine a bit bit to be used in a current hash packet process from the unspecified bit position of the call, where a bit number used in a current hash packet process is used According to the size of the library cache cache and the storage size of the page table buffer TLB;
第二查找子模块,用于查找当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各个原始数据, 将各个原始数据划分在同一小组内;  The second search sub-module is configured to search for each original data corresponding to the hash value of the same bit position in the current hash grouping process, and divide each original data into the same group;
第二遍历子模块, 用于遍历划分在同一小组内的各个原始数据的下标, 所 述各个原始数据的下标用于标识各个原始数据在所述目标数据组中的位置; 第二排列子模块, 用于按照各个下标的大小,从小至大排列各个下标对应 的各个原始数据;  a second traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange each original data corresponding to each subscript from small to large according to the size of each subscript;
第二排序子模块,用于依据所述从小至大的顺序将各个原始数据写入同一 小组内并保存。  The second sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
需要说明的是, 所述连接单元 105 , 其具体执行过程以及原理可参见上述 本发明实施例四中公开 Join操作的详细说明部分, 这里不再进行赘述, 其主 要包括:  It should be noted that, the specific execution process and the principle of the connection unit 105 can be referred to the detailed description of the Join operation in the fourth embodiment of the present invention, and details are not described herein.
获取模块, 用于按顺序分别获取所述待连接的两个目标数据组进行 N次 hash分组后的各个小组;  An obtaining module, configured to respectively acquire, in sequence, the two target data groups to be connected to each group after the N times hash grouping;
Join模块,用于两两小组为一对进行原始数据 Join操作的方式,对两个目 标数据组的各个小组中原始数据进行 Join操作;  The Join module is used to perform the Join operation of the raw data in each group of the two target data groups by performing a Join operation of the original data for the pair of two groups;
其中, 所述 Join模块包括:  The Join module includes:
第三遍历子模块,用于由一目标数据组中的一小组顺序遍历另一目标数据 组中的各个小组; 若遍历到相同小组时, 执行第一 Join子模块; 若未遍历到 相同小组时, 则移动至下一小组返回所述第二遍历子模块; 直至所述目标数据 组中的所有小组对另一目标数据组中的各个小组都执行遍历操作;  a third traversal sub-module for sequentially traversing each group in another target data group by a group in a target data group; if traversing to the same group, executing the first Join sub-module; if not traversing to the same group Moving to the next group to return to the second traversal sub-module; until all groups in the target data group perform traversal operations on each of the other target data groups;
所述第一 Join子模块, 用于将进行遍历的所述小组中的原始数据, 顺序 与所述相同小组内的原始数据进行 Join操作, 其中, 所述相同小组是指该小 组内存储的原始数据的 hash值与用于遍历的小组内存储的原始数据的 hash值 相同; 当所述小组中的原始数据都已进行执行 Join操作后, 移动至下一小组 返回所述第三遍历子模块。  The first Join sub-module is configured to perform a Join operation on the original data in the group that is traversed, and the original data in the same group, wherein the same group refers to the original stored in the group. The hash value of the data is the same as the hash value of the original data stored in the group for traversing; after the original data in the group has been subjected to the Join operation, moving to the next group returns to the third traversal sub-module.
本发明实施例五公开对应执行上述哈希连接方法的哈希连接装置,基于上 述公开的各个单元以及模块, 在对目标数据组执行 hash分组的过程中, 以一 个 vector为数量单位按组计算 hash值, 然后将同一个分组中所包含的若干个 原始数据对应的 hash值一次性写入对应的分组中。 以 vector的形式进行分组 则能够避免产生不必要的 cache thrashing, 从而实现减少了 cache miss, 提升 Join性能的目的。 并且, 仅在第 1次分组过程中计算各个原始数据的 hash值, 而将后续使用到的若干 bit位记录到对应的各个原始数据的关联位置处以备后 续分组过程中直接使用, 从而省去重复计算 hash值的代价, 避免了资源浪费。 Embodiment 5 of the present invention discloses a hash connection apparatus corresponding to the execution of the hash connection method described above. Based on the units and modules disclosed above, in the process of performing a hash grouping on a target data group, the hash is calculated in groups by a vector. The value, and then the hash value corresponding to several original data included in the same group is once written into the corresponding group. Group by vector It can avoid unnecessary cache thrashing, which can reduce the cache miss and improve the performance of Join. Moreover, the hash value of each original data is calculated only in the first grouping process, and the number of bits used in the subsequent use are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.
同时, 在进行 hash分组的过程中, 在每一次 hash分组之后, 将原始数据 写入各个对应的小组内之前,对各个小组内的原始数据进行排序, 最终在各个 小组做最后的排序时, 仅需要对各个小组进行排序即可。 通过该种方式, 能够 实现大大的降低现有技术中在分组完成之后,再对各个小组内的原始数据以及 各个小组进行排序的复杂度, 减少因排序消耗的时间的目的。  At the same time, in the process of hash grouping, before each hash grouping, before the original data is written into each corresponding group, the original data in each group is sorted, and finally, when each group performs the final sorting, only You need to sort each group. In this way, the complexity of sorting the original data and the groups in each group after the grouping is completed in the prior art can be greatly reduced, and the time consumed by the sorting is reduced.
结合本发明公开的实施例描述的哈希连接方法,在数据管理系统中可以直 接用硬件、 处理器执行的存储器, 或者二者的结合来实施。 因此, 本发明还对 应上述本发明实施例公开的方法和装置公开了一种数据管理系统。下面给出具 体的实施例进行详细说明。  The hash connection method described in connection with the embodiments of the present disclosure can be implemented directly in hardware, in a memory executed by a processor, or a combination of both in a data management system. Accordingly, the present invention also discloses a data management system in accordance with the method and apparatus disclosed in the above embodiments of the present invention. Specific embodiments are given below for detailed description.
如图 9所示, 该数据管理系统 1 包括存储器 11和通过总线 12与存储器 11连接的处理器 13。  As shown in FIG. 9, the data management system 1 includes a memory 11 and a processor 13 connected to the memory 11 via a bus 12.
该存储器 11具有存储介质, 该存储介质中存储有进行数据库查询时的程 序。  The memory 11 has a storage medium in which a program for performing a database query is stored.
存储器 11可能包含高速 RAM存储器, 也可能还包括非易失性存储器, 例如至少一个磁盘存储器。  The memory 11 may contain high speed RAM memory and may also include non-volatile memory such as at least one disk memory.
该处理器 13通过总线 13与存储器 11连接, 当执行数据库查询时, 该处 理器 13调用存储器 11中存储的数据库查询程序。上述数据库查询程序可以包 括程序代码, 所述程序代码包括一系列按照一定顺排列的操作指令。 处理器 13可能是一个中央处理器 CPU, 或者是特定集成电路, 或者是被配置成实施 本发明实施例的一个或多个集成电路。  The processor 13 is connected to the memory 11 via a bus 13, and the processor 13 calls the database query program stored in the memory 11 when performing a database query. The database query program may include program code, and the program code includes a series of operation instructions arranged in a certain order. Processor 13 may be a central processing unit CPU, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.
处理器 13调用的进行数据调度的程序具体可以包括:  The program for performing data scheduling invoked by the processor 13 may specifically include:
接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析获取至少两 个待连接的目标数据组;  Receiving a structured query language SQL statement including a join operation, parsing and acquiring at least two target data groups to be connected;
以矢量 vector为数量单位将每一目标数据组划分为多个数据段;  Dividing each target data group into multiple data segments by a vector vector;
基于预设分组规则依次对每一目标数据组中的数据段进行 N次哈希 hash 分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所述数据段中的 原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程中指定 bit位上 取值相同的 hash值所对应的原始数据划分在同一小组内, 并对划分在同一小 组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小 组内进行排序并保存, N取大于或等于 1的正整数; Performing N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the data segment based on the first hash packet The hash value represented by the bit data obtained by the original data divides the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and divides the original data divided into the same group. , sorting and saving in the same group according to the position of each original data in the target data group, and N is a positive integer greater than or equal to 1;
对每一目标数据组经过 N次 hash分组后获得的小组, 在所述目标数据组 中, 按照各个小组中所包含的原始数据对应的 hash值由小至大的对各个小组 进行排序;  For each group obtained after N times of hash grouping for each target data group, in the target data group, the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
按照排序依次取所述两个待连接的目标数据组中经由 N次 hash分组后获 得的各个小组中的原始数据进行 Join操作。  The Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.
综上所述:  In summary:
本发明实施例公开通过以 vector 为数量单位以及在后续分组过程中利用 前一次 hash分组过程中未被指定 bit位继续执行 hash分组, 能够实现同时对 若干原始数据进行 hash分组处理, 且在多次 hash分组的过程中不需要重复计 算原始数据的 hash值,即减少了 cache miss緩存缺失,也省去了重复计算 hash 值避免了计算资源的浪费。 同时, 在进行分组的过程中, 按照所述原始数据在 目标数据组中的位置,对划分在同一小组中的原始数据进行排序, 实现降低后 续对各个分组进行排序的复杂度的目的。并且每次分组划分至每个小组中的原 始数据有序,这样完成多个数据段分组后得到的每个小组中的原始数据局部有 序,在对局部有序的原始数据进行 join时,其排序复杂度低于随机分配的原始 数据进行 join时的排序复杂度。  The embodiment of the present invention discloses that by performing the hash packet by using the vector as the quantity unit and using the unspecified bit in the previous hash grouping process in the subsequent grouping process, it is possible to perform hash packet processing on several original data at the same time, and multiple times. In the process of hash grouping, it is not necessary to repeatedly calculate the hash value of the original data, that is, the cache miss cache is reduced, and the hash value is repeatedly calculated to avoid waste of computing resources. At the same time, in the process of grouping, according to the position of the original data in the target data group, the original data divided in the same group is sorted, thereby achieving the purpose of reducing the complexity of sorting each group. And each time the grouping is divided into the original data in each group, so that the original data in each group obtained after grouping the plurality of data segments is locally ordered, and when the local ordered original data is joined, The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.
本说明书中各个实施例釆用递进的方式描述,每个实施例重点说明的都是 与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于 实施例公开的装置而言, 由于其与实施例公开的方法相对应, 所以描述的比较 简单,相关之处参见方法部分说明即可。 结合本文中所公开的实施例描述的方 法或算法的步骤可以直接用硬件、处理器执行的软件模块, 或者二者的结合来 实施。 软件模块可以置于随机存储器(RAM )、 内存、 只读存储器 (ROM )、 电可编程 ROM、电可擦除可编程 ROM、寄存器、硬盘、可移动磁盘、 CD-ROM, 或技术领域内所公知的任意其它形式的存储介质中。  The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part. The steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本 发明。 对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见 的。  The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be apparent to those skilled in the art.

Claims

权 利 要 求 Rights request
1、 一种哈希连接方法, 其特征在于, 应用于数据库, 包括:  A hash connection method, which is characterized by being applied to a database, comprising:
接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析获取至少两 个待连接的目标数据组;  Receiving a structured query language SQL statement including a join operation, parsing and acquiring at least two target data groups to be connected;
以矢量 vector为数量单位将每一目标数据组划分为多个数据段;  Dividing each target data group into multiple data segments by a vector vector;
基于预设分组规则依次对每一目标数据组中的数据段进行 N次哈希 hash 分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所述数据段中的 原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程中指定 bit位上 取值相同的 hash值所对应的原始数据划分在同一小组内, 并对划分在同一小 组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同一小 组内进行排序并保存, N取大于或等于 1的正整数;  Performing N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the original data in the data segment based on the first hash packet The hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group is classified according to each original. The positions of the data in the target data group are sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;
对每一目标数据组经过 N次 hash分组后获得的小组, 在所述目标数据组 中, 按照各个小组中所包含的原始数据对应的 hash值由小至大的对各个小组 进行排序;  For each group obtained after N times of hash grouping for each target data group, in the target data group, the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;
按照排序依次取所述两个待连接的目标数据组中经由 N次 hash分组后获 得的各个小组中的原始数据进行 Join操作。  The Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.
2、 根据权利要求 1所述的方法, 其特征在于, 所述基于预设分组规则依 次对每一目标数据组中的数据段进行 N次 hash分组中的第 1次 hash分组包括: 计算当前所述数据段内包含的原始数据的 hash值,并用 bit位表示计算所 得 hash值;  The method according to claim 1, wherein the performing the first hash packet in the N times hash packet for each data segment in each target data group according to the preset grouping rule comprises: calculating a current location The hash value of the original data contained in the data segment, and the bit value is used to represent the calculated hash value;
将位于指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组 内, 并对划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数 据组中的位置在同一小组内进行排序和保存;  The original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and the original data divided in the same group is in the same position in the target data group according to each original data. Sort and save within the group;
将每一个原始数据对应的 hash值中未被指定的 bit位与该原始数据进行关 联, 并保存;  The unspecified bit bits of the hash value corresponding to each original data are associated with the original data and saved;
所述基于预设分组规则依次对每一目标数据组中的数据段进行 N次 hash 分组中的第 2次至第 n次 hash分组包括:  The performing the second to nth hash packets in the N times of the hash packets in the data segment in each target data group according to the preset grouping rule includes:
对上一次 hash分组后得到的任意一小组中的原始数据进行 hash分组, n 包含于 N, 取大于 2的正整数包括:  Hash grouping the original data in any group obtained after the last hash grouping, n is included in N, and a positive integer greater than 2 includes:
基于当前小组内的原始数据所关联并保存的上一次 hash分组中未被指定 的 bit位, 将当前 hash分组过程中指定 bit位上取值相同的 hash值所对应的各 个原始数据划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照 各个原始数据在所述目标数据组中的位置在同一小组内进行排序和保存; Not specified in the last hash group associated with and saved based on the original data in the current team Bits, the original data corresponding to the hash value of the same bit in the current hash grouping process are divided into the same group, and each original data divided in the same group is in accordance with each original data. The locations in the target data set are sorted and saved in the same group;
将每一个原始数据关联的剩余的未被指定的 bit位再次保存。  The remaining unspecified bit bits associated with each raw data are saved again.
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述预设分组规则包 括: 预设 hash分组次数 N, 或者预设分组总数 S , 或者预设 hash分组次数 N 和预设分组总数 S;  The method according to claim 1 or 2, wherein the preset grouping rule comprises: a preset hash packet number N, or a preset total number of packets S, or a preset hash packet number N and a preset packet Total number S;
当所述预设分组规则是预设 hash分组次数 N时, 依次对每一所述目标数 据中的数据段进行 hash分组, 直至完成 N次 hash分组;  When the preset grouping rule is the preset hash packet number N, the data segments in each of the target data are hash-grouped in turn until the N-th hash packet is completed;
当所述预设分组规则是预设分组总数 S时,依次对每一所述目标数据组中 的数据段进行 hash分组,直至每一所述目标数据组的分组数等于预设分组数; 当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级高于预设分组总数 S时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至完成 N次 hash分组;  When the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the number of packets of each of the target data groups is equal to the preset number of packets; The preset grouping rule is a preset hash packet number N and a preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, and is sequentially used in each of the target data groups. The data segment is hashed until the hash packet is completed N times;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 分组总数 S的优先级高于预设 hash分组次数 N时, 依次对每一所述目标数据 组中的数据段进行 hash分组, 直至每一所述目标数据组的分组数等于预设分 组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级和预设分组总数 S的优先级一致, 依次对每一所述 目标数据组中的数据段进行 hash分组,直至完成 N次 hash分组且每一所述目 标数据组的分组数等于预设分组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, sequentially for each of the target data The data segment in the group is hashed until the hash packet is completed N times and the number of packets of each target data group is equal to the total number of preset packets S;
其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整 数, N包含 n; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正 整数;  The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2;
所述预设 hash分组次数 N与预设分组总数 S的优先级由 TLB的存储大小 和 cache的大小决定。  The priority of the preset hash packet number N and the total number of preset packets S is determined by the storage size of the TLB and the size of the cache.
4、 根据权利要求 1或 2所述的方法, 其特征在于, 所述预设分组规则包 括: 预设 hash分组次数 N, 预设的每一次 hash分组的分组数 m和预设分组总 数 S; 其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正 整数, m小于 N; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的 正整数; The method according to claim 1 or 2, wherein the preset grouping rule comprises: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; The value of N is determined by the storage size of the page table buffer TLB, which is greater than or equal to 1. Integer, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2;
所述依次对每一所述目标数据中的数据段进行 hash分组时, 按照预设的 每一次 hash分组的分组数进行分组, 使得最后的分组次数等于预设 hash分组 次数, 所分的小组的总数等于预设分组总数。  When performing the hash grouping on the data segments in each of the target data, the packets are grouped according to the preset number of packets of each hash packet, so that the last number of packets is equal to the preset hash packet number, and the group of the group is divided. The total is equal to the total number of preset groups.
5、 根据权利要求 1~4中任意一项所述的方法, 其特征在于, 所述以矢量 vector为数量单位将每一目标数据组划分为多个数据段包括:  The method according to any one of claims 1 to 4, wherein the dividing each target data group into a plurality of data segments by using a vector vector is:
以矢量 vector为数量单位, 一个 vector对应一个数据段, 顺序将每一目标 数据组划分为 M个数据段, M的取值由所述目标数据组内的原始数据的个数, 及数据库緩存 cache的大小和页表緩冲 TLB的存储大小决定;  The vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;
其中, 第 1至第 M-1个数据段中所包含的原始数据的个数相同, 第 M个 数据段中所包含的原始数据的个数小于或等于第 1至 M-1个数据段中所包含 的原始数据的个数。  The number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments. The number of raw data contained.
6、 根据权利要求 2~4中任意一项所述的方法, 其特征在于, 将位于指定 bit位上取值相同的 hash值所对应的各个原始数据划分在同一小组内, 并对划 分在同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位 置在同一小组内进行排序和保存包括: 查找位于当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各 个原始数据,将各个原始数据划分在同一小组内,其中,依据数据库緩存 cache 的大小和页表緩冲 TLB的存储大小指定当前 hash分组所需用到的 bit位; 遍历划分在同一小组内的各个原始数据的下标,所述各个原始数据的下标 用于标识各个原始数据在所述目标数据组中的位置;  The method according to any one of claims 2 to 4, wherein each original data corresponding to a hash value having the same value on the specified bit position is divided into the same group, and is divided into the same group. Each raw data in the group is sorted and saved in the same group according to the position of each original data in the target data group, including: searching for each hash value corresponding to the same value in the specified bit position in the current hash grouping process The original data, the original data is divided into the same group, wherein the bit size required for the current hash group is specified according to the size of the database cache cache and the storage size of the page table buffer TLB; traversing each of the same group a subscript of the original data, the subscript of each of the original data is used to identify a location of each original data in the target data group;
按照各个下标的大小, 从小至大排列各个下标对应的原始数据;  According to the size of each subscript, the original data corresponding to each subscript is arranged from small to large;
依据所述从小至大的顺序将各个原始数据写入同一小组内并保存。  Each raw data is written into the same group and saved in the order from small to large.
7、 根据权利要求 2~4中任意一项所述的方法, 其特征在于, 基于当前小 组内的原始数据所关联并保存的上一次 hash分组中未被指定的 bit位,将当前 hash分组过程中指定 bit位上取值相同的 hash值所对应的各个原始数据划分在 同一小组内, 并对划分在同一小组内的各个原始数据,按照各个原始数据在所 述目标数据组中的位置在同一小组内进行排序和保存包括: 调用当前进行 hash分组的小组内各个原始数据关联位置处所保存的上一 次 hash分组中未被指定的 bit位; The method according to any one of claims 2 to 4, wherein the current hash grouping process is performed based on unspecified bits in the last hash packet associated with and saved by the original data in the current group. Each raw data corresponding to the same hash value in the specified bit position is divided into the same group, and each original data divided in the same group is in the same position in the target data group according to each original data. Sorting and saving within the group includes: Invoking an unspecified bit in the last hash packet saved at each original data association location in the group currently performing the hash packet;
从调用的所述未被指定的 bit位中确定当前 hash分组过程中所需用到的 bit位, 其中, 当前 hash分组过程中所需用到的 bit位依据数据库緩存 cache 的大小和页表緩冲 TLB的存储大小决定;  Determining the bit bits required in the current hash packet process from the unspecified bit bits of the call, wherein the bit bits required in the current hash packet process are slowed according to the size of the database cache cache and the page table. The storage size of the TLB is determined;
查找位于当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各 个原始数据, 将各个原始数据划分在同一小组内;  Finding each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and dividing each original data into the same group;
遍历划分在同一小组内的各个原始数据的下标,所述各个原始数据的下标 用于标识各个原始数据在所述目标数据组中的位置;  Traversing the subscripts of the respective original data divided into the same group, the subscripts of the respective original data are used to identify the locations of the respective original data in the target data group;
按照各个下标的大小, 从小至大排列各个下标对应的原始数据;  According to the size of each subscript, the original data corresponding to each subscript is arranged from small to large;
依据所述从小至大的顺序将各个原始数据写入同一小组内并保存。  Each raw data is written into the same group and saved in the order from small to large.
8、 根据权利要求 1~7中任意一项所述的方法, 其特征在于, 所述按照排 序依次取所述两个待连接的目标数据组中经由 N次 hash分组后获得的各个小 组中的原始数据进行 Join操作包括:  The method according to any one of claims 1 to 7, wherein the selecting, in order, the respective groups of the target data groups to be connected are obtained in each group obtained after N times of hash grouping. The raw data for Join operation includes:
按顺序分别获取所述待连接的两个目标数据组进行 N次 hash分组后的各 个小组;  Obtaining, in order, the two target data groups to be connected respectively for each group after N hash packets;
两两小组为一对进行原始数据 Join操作的方式, 对两个目标数据组的各 个小组中原始数据进行 Join操作;  The two groups work as a pair of raw data join operations, and perform the Join operation on the original data in each of the two target data groups;
所述两两小组为一对进行原始数据 Join操作的方式包括:  The manner in which the two groups are a pair of original data join operations includes:
由一目标数据组中的一小组顺序遍历另一目标数据组中的各个小组; 若遍历到相同小组时,将所述小组中的原始数据, 顺序与所述相同小组内 的原始数据进行 Join操作, 其中, 所述相同小组是指该小组内存储的原始数 据的 hash值与用于遍历的小组内存储的原始数据的 hash值相同;  Navigating each group in another target data group sequentially by a group in a target data group; if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group , wherein the same group means that the hash value of the original data stored in the group is the same as the hash value of the original data stored in the group for traversing;
当所述小组中的原始数据都已进行执行 Join操作后, 移动至下一小组返 回执行顺序遍历另一目标数据组中的各个小组这一步骤;  After the original data in the group has been subjected to the Join operation, move to the next group to return to the execution sequence to traverse the various groups in the other target data group;
若未遍历到相同小组时,则移动至下一小组返回执行顺序遍历另一目标数 据中的各个小组这一步骤;  If the same group is not traversed, move to the next group to return to the execution sequence to traverse the various groups in the other target data;
直至所述目标数据组中的所有小组对另一目标数据组中的各个小组都执 行遍历操作。  Until all teams in the target data set perform traversal operations on each of the other target data groups.
9、 一种哈希连接装置, 其特征在于, 应用于数据库, 包括: 接收单元, 用于接收包含有连接 Join操作的结构化查询语言 SQL语句, 解析获取至少两个待连接的目标数据组; 9. A hash connection device, characterized by being applied to a database, comprising: a receiving unit, configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;
划分单元,用于以矢量 vector为数量单位将每一目标数据组划分为多个数 据段;  a dividing unit, configured to divide each target data group into a plurality of data segments by using a vector vector;
分组单元,用于基于预设分组规则依次对每一目标数据组中的数据段进行 a grouping unit, configured to sequentially perform data segments in each target data group based on preset grouping rules
N次哈希 hash分组, 其中, 在每次 hash分组时, 基于第 1次 hash分组计算所 述数据段中的原始数据所得的用 bit位表示的 hash值, 将当前 hash分组过程 中指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组内,并对 划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的 位置在同一小组内进行排序并保存, N取大于或等于 1的正整数; N times hash hash packet, wherein, in each hash packet, the hash value represented by the bit data obtained by calculating the original data in the data segment is calculated based on the first hash packet, and the specified bit bit in the current hash grouping process is performed. The raw data corresponding to the same hash value is divided into the same group, and each original data divided in the same group is sorted in the same group according to the position of each original data in the target data group. Save, N takes a positive integer greater than or equal to 1;
排序单元, 用于对每一目标数据组经过 N次 hash分组后获得的小组, 在 所述目标数据组中, 按照各个小组中所包含的原始数据对应的 hash值由小至 大对各个小组进行排序;  a sorting unit, configured to obtain a group obtained after N times hash grouping for each target data group, in which the hash values corresponding to the original data included in each group are performed from small to large for each group in the target data group Sort
连接单元, 用于按照排序依次取所述两个待连接的目标数据组中经由 N 次 hash分组后获得的各个小组中的原始数据进行 Join操作。  a connecting unit, configured to perform a Join operation on the original data in each group obtained after the N hash packets in the target data groups to be connected in the order of the two connected data groups.
10、 根据权利要求 9所述的装置, 其特征在于, 所述分组单元包括: 每一 目标数据组中的数据段进行第 1次 hash分组的一次 hash分组模块; 以及, 对 上一次 hash分组后得到的任意一小组中的原始数据进行第 2次至第 n次 hash 分组的多次 hash分组模块, n包含于 N, 取大于 2的正整数;  The device according to claim 9, wherein the grouping unit comprises: a hash packet module for performing a first hash packet in a data segment in each target data group; and, after the last hash packet Obtaining the raw data in any one of the groups to perform the multiple hash packet module of the second to the nth hash group, where n is included in N, and a positive integer greater than 2 is taken;
所述一次 hash分组模块, 用于计算当前所述数据段内包含的原始数据的 hash值,并用 bit位表示计算所得 hash值;将位于指定 bit位上取值相同的 hash 值所对应的原始数据划分在同一小组内,并对划分在同一小组内的各个原始数 据,按照各个原始数据在所述目标数据组中的位置在同一小组内进行排序和保 存;将每一个原始数据对应的 hash值中未被指定的 bit位与该原始数据进行关 联并保存;  The one-time hash grouping module is configured to calculate a hash value of the original data included in the current data segment, and use the bit digit to represent the calculated hash value; and the original data corresponding to the hash value with the same value in the specified bit position Divided into the same group, and sorted and saved the original data divided into the same group in the same group according to the position of each original data in the target data group; the hash value corresponding to each original data is The unspecified bit is associated with the original data and saved;
所述多次 hash分组模块, 用于基于当前小组内的原始数据所关联并保存 的上一次 hash分组中未被指定的 bit位, 将当前 hash分组过程中指定 bit位上 取值相同的 hash值所对应的各个原始数据划分在同一小组内, 并对划分在同 一小组内的各个原始数据,按照各个原始数据在所述目标数据组中的位置在同 一小组内进行排序和保存; 将每一个原始数据关联的剩余的未被指定的 bit位 再次保存。 The multiple hash grouping module is configured to: use the unspecified bit in the last hash packet associated with and saved by the original data in the current group, and set the hash value with the same value in the specified bit position in the current hash grouping process. The corresponding original data is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group; The remaining unspecified bits of the data association Save again.
11、 根据权利要求 9或 10所述的装置, 其特征在于, 包括:  The device according to claim 9 or 10, comprising:
当所述预设分组规则是预设 hash分组次数 N时, 所述分组单元, 用于依 次对每一所述目标数据组中的数据段进行 hash分组,直至完成 N次 hash分组; 当所述预设分组规则是预设分组总数 S时 ,所述分组单元 ,用于依次对每 一所述目标数据组中的数据段进行 hash分组, 直至每一所述目标数据组的分 组数等于预设分组数;  When the preset grouping rule is a preset hash packet number N, the grouping unit is configured to perform hash grouping on data segments in each of the target data groups in sequence, until N times hash packets are completed; When the preset grouping rule is the preset total number S of packets, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of groups of each target data group is equal to a preset. Number of groups
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级高于预设分组总数 S时, 所述分组单元, 用于依次 对每一所述目标数据组中的数据段进行 hash分组, 直至完成 N次 hash分组; 当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 分组总数 S的优先级高于预设 hash分组次数 N时, 所述分组单元, 用于依次 对每一所述目标数据组中的数据段进行 hash分组, 直至每一所述目标数据组 的分组数等于预设分组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, the grouping unit is used to sequentially The data segment in the target data group is hashed until the hash packet is completed N times; when the preset packet rule is the preset hash packet number N and the preset packet total number S, and the preset packet total S is prioritized When the level is higher than the preset hash packet number N, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of packets of each target data group is equal to a preset group. Total number S;
当所述预设分组规则是预设 hash分组次数 N和预设分组总数 S, 且预设 hash分组次数 N的优先级和预设分组总数 S的优先级一致, 所述分组单元, 用于依次对每一所述目标数据组中的数据段进行 hash分组,直至完成 N次 hash 分组且每一所述目标数据组的分组数等于预设分组总数 S;  When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the grouping unit is used to sequentially Performing a hash grouping on the data segments in each of the target data groups until the N times hash packets are completed and the number of packets of each of the target data groups is equal to the preset total number S of packets;
其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整 数, N包含 n; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正 整数;所述预设 hash分组次数 N与预设分组总数 S的优先级由 TLB的存储大 小和 cache的大小决定。  The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2; The priority of the preset hash packet number N and the preset packet total S is determined by the storage size of the TLB and the size of the cache.
12、 根据权利要求 9或 10所述的装置, 其特征在于, 包括:  The device according to claim 9 or 10, comprising:
当所述预设分组规则包括预设 hash分组次数 N, 预设的每一次 hash分组 的分组数 m和预设分组总数 S时,所述分组单元,用于按照预设的每一次 hash 分组的分组数进行分组, 使得最后的分组次数等于预设 hash分组次数, 所分 的小组的总数等于预设分组总数;  When the preset grouping rule includes a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S, the grouping unit is configured to group each hash according to a preset The number of packets is grouped such that the last number of packets is equal to the number of preset hash packets, and the total number of groups divided is equal to the total number of preset packets;
其中, N的取值由页表緩冲 TLB的存储大小决定, 为大于等于 1的正整 数, m小于 N; S的取值由数据库緩存 cache的大小决定, 为大于等于 2的正 整数。 The value of N is determined by the storage size of the page table buffer TLB, and is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2.
13、 根据权利要求 9~12中任意一项所述的装置, 其特征在于, 所述划分 单元包括: The device according to any one of claims 9 to 12, wherein the dividing unit comprises:
第一划分模块, 用于以矢量 vector为数量单位, 一个 vector对应一个数据 段, 顺序将每一目标数据组划分为 M个数据段, M的取值由所述目标数据组 内的原始数据的个数, 及数据库緩存 cache的大小和页表緩冲 TLB的存储大 小决定;  a first dividing module, configured to use a vector vector as a quantity unit, a vector corresponding to a data segment, and sequentially dividing each target data group into M data segments, wherein the value of M is determined by the original data in the target data group The number, and the size of the database cache cache and the storage size of the page table buffer TLB;
其中, 第 1至第 M-1个数据段中所包含的原始数据的个数相同, 第 M个 数据段中所包含的原始数据的个数小于或等于第 1至 M-1个数据段中所包含 的原始数据的个数。  The number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments. The number of raw data contained.
14、根据权利要求 10~12任意一项所述的装置, 其特征在于, 所述用于将 位于指定 bit位上取值相同的 hash值所对应的原始数据划分在同一小组内,并 对划分在同一小组内的各个原始数据,按照各个原始数据在所述目标数据组中 的位置在同一小组内进行排序和保存的所述一次 hash分组模块包括: 位表示的 hash值;  The device according to any one of claims 10 to 12, wherein the original data corresponding to the hash value having the same value in the specified bit position is divided into the same group, and is divided into Each of the raw data in the same group, the hash packet module that is sorted and saved in the same group according to the position of each original data in the target data group includes: a hash value represented by a bit;
第一查找子模块,用于查找位于当前 hash分组过程中指定 bit位上取值相 同的 hash值对应的各个原始数据, 将各个原始数据划分在同一小组内, 其中, 依据数据库緩存 cache的大小和页表緩冲 TLB的存储大小指定当前 hash分组 所需用到的 bit位;  The first search sub-module is configured to search for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and divide each original data into the same group, wherein, according to the size of the database cache cache and The storage size of the page table buffer TLB specifies the bit bits needed for the current hash packet;
第一遍历子模块, 用于遍历划分在同一小组内的各个原始数据的下标, 所 述各个原始数据的下标用于标识各个原始数据在所述目标数据组中的位置; 第一排列子模块, 用于按照各个下标的大小,从小至大排列各个下标对应 的原始数据;  a first traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;
第一排序子模块,用于依据所述从小至大的顺序将各个原始数据写入同一 小组内并保存。  The first sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
15、根据权利要求 10~12任意一项所述的装置, 其特征在于, 所述基于当 前小组内的原始数据所关联并保存的上一次 hash分组中未被指定的 bit位,将 划分在同一小组内, 并对划分在同一小组内的各个原始数据,按照各个原始数 据在所述目标数据组中的位置在同一小组内进行排序和保存的所述多次 hash 分组模块包括: The device according to any one of claims 10 to 12, wherein the bits that are not specified in the last hash packet associated and saved based on the original data in the current group are divided into the same Within the group, and for each raw data divided in the same group, the multiple hashes sorted and saved in the same group according to the position of each original data in the target data group The grouping module includes:
调用子模块, 用于调用当前进行 hash分组的小组内各个原始数据关联位 置处所保存的上一次 hash分组中未被指定的 bit位;  Calling a sub-module for invoking an unspecified bit in the last hash packet saved at each original data associated location within the group currently performing the hash packet;
确定子模块,用于从调用的所述未被指定的 bit位中确定当前 hash分组过 程中所需用到的 bit位, 其中, 当前 hash分组过程中所需用到的 bit位依据数 据库緩存 cache的大小和页表緩冲 TLB的存储大小决定;  Determining a sub-module, configured to determine a bit bit used in the current hash packet process from the unspecified bit position of the call, where the bit bit required in the current hash packet process is based on a database cache cache The size and size of the page table buffer TLB storage;
第二查找子模块,用于查找当前 hash分组过程中指定 bit位上取值相同的 hash值对应的各个原始数据, 将各个原始数据划分在同一小组内;  The second search sub-module is configured to search for each original data corresponding to the hash value of the same bit position in the current hash grouping process, and divide each original data into the same group;
第二遍历子模块, 用于遍历划分在同一小组内的各个原始数据的下标, 所 述各个原始数据的下标用于标识各个原始数据在所述目标数据组中的位置; 第二排列子模块, 用于按照各个下标的大小,从小至大排列各个下标对应 的原始数据;  a second traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;
第二排序子模块,用于依据所述从小至大的顺序将各个原始数据写入同一 小组内并保存。  The second sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.
16、 根据权利要求 9~15中任意一项所述的装置, 其特征在于, 所述连接 单元包括:  The device according to any one of claims 9 to 15, wherein the connecting unit comprises:
获取模块, 用于按顺序分别获取所述待连接的两个目标数据组进行 N次 hash分组后的各个小组;  An obtaining module, configured to respectively acquire, in sequence, the two target data groups to be connected to each group after the N times hash grouping;
Join模块,用于两两小组为一对进行原始数据 Join操作的方式,对两个目 标数据组的各个小组中原始数据进行 Join操作;  The Join module is used to perform the Join operation of the raw data in each group of the two target data groups by performing a Join operation of the original data for the pair of two groups;
其中, 所述 Join模块包括:  The Join module includes:
第三遍历子模块,用于由一目标数据组中的一小组顺序遍历另一目标数据 组中的各个小组; 若遍历到相同小组时, 执行第一 Join子模块; 若未遍历到 相同小组时, 则移动至下一小组返回所述第二遍历子模块; 直至所述目标数据 组中的所有小组对另一目标数据组中的各个小组都执行遍历操作;  a third traversal sub-module for sequentially traversing each group in another target data group by a group in a target data group; if traversing to the same group, executing the first Join sub-module; if not traversing to the same group Moving to the next group to return to the second traversal sub-module; until all groups in the target data group perform traversal operations on each of the other target data groups;
所述第一 Join子模块, 用于将进行遍历的所述小组中的原始数据, 顺序 与所述相同小组内的原始数据进行 Join操作, 其中, 所述相同小组是指该小 组内存储的原始数据的 hash值与用于遍历的小组内存储的原始数据的 hash值 相同; 当所述小组中的原始数据都已进行执行 Join操作后, 移动至下一小组 返回所述第三遍历子模块。 The first Join sub-module is configured to perform a Join operation on the original data in the group that is traversed, and the original data in the same group, wherein the same group refers to the original stored in the group. The hash value of the data is the same as the hash value of the original data stored in the group for traversing; after the original data in the group has been subjected to the Join operation, moving to the next group returns to the third traversal sub-module.
17、 一种数据库管理系统, 其特征在于, 应用于数据库, 包括: 具有存储介质的存储器, 所述存储器中存储有进行数据库查询时的程序; 通过总线与所述存储器连接的处理器, 当执行数据库查询时, 所述处理器调用 所述存储器中存储的数据库查询程序,并依据权利要求 1~8中任意一项所述的 方法执行所述数据库查询程序。 A database management system, comprising: applying to a database, comprising: a memory having a storage medium, wherein the memory stores a program for performing a database query; and a processor connected to the memory through a bus, when executed When the database is queried, the processor invokes a database query program stored in the memory, and executes the database query program according to the method of any one of claims 1-8.
PCT/CN2014/078304 2014-05-23 2014-05-23 Hash join method, device and database management system WO2015176315A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/078304 WO2015176315A1 (en) 2014-05-23 2014-05-23 Hash join method, device and database management system
CN201480037464.8A CN105359142B (en) 2014-05-23 2014-05-23 Hash connecting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/078304 WO2015176315A1 (en) 2014-05-23 2014-05-23 Hash join method, device and database management system

Publications (1)

Publication Number Publication Date
WO2015176315A1 true WO2015176315A1 (en) 2015-11-26

Family

ID=54553263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/078304 WO2015176315A1 (en) 2014-05-23 2014-05-23 Hash join method, device and database management system

Country Status (2)

Country Link
CN (1) CN105359142B (en)
WO (1) WO2015176315A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019070340A1 (en) * 2017-10-06 2019-04-11 Microsoft Technology Licensing, Llc Join operation and interface for wildcards
CN111026720A (en) * 2019-12-20 2020-04-17 深信服科技股份有限公司 File processing method, system and related equipment
CN111125011A (en) * 2019-12-20 2020-05-08 深信服科技股份有限公司 File processing method, system and related equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326475B (en) * 2016-08-31 2019-12-27 中国科学院信息工程研究所 Efficient static hash table implementation method and system
CN108549666B (en) * 2018-03-22 2021-05-04 上海达梦数据库有限公司 Data table sorting method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162410A1 (en) * 2006-12-27 2008-07-03 Motorola, Inc. Method and apparatus for augmenting the dynamic hash table with home subscriber server functionality for peer-to-peer communications
CN101593202A (en) * 2009-01-14 2009-12-02 中国人民解放军国防科学技术大学 Based on the hash connecting method for database of sharing the Cache polycaryon processor
CN102508924A (en) * 2011-11-22 2012-06-20 上海达梦数据库有限公司 Method for realizing grace hash joint by using merge join
US20130173589A1 (en) * 2011-12-29 2013-07-04 Yu Xu Techniques for optimizing outer joins

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162410A1 (en) * 2006-12-27 2008-07-03 Motorola, Inc. Method and apparatus for augmenting the dynamic hash table with home subscriber server functionality for peer-to-peer communications
CN101593202A (en) * 2009-01-14 2009-12-02 中国人民解放军国防科学技术大学 Based on the hash connecting method for database of sharing the Cache polycaryon processor
CN102508924A (en) * 2011-11-22 2012-06-20 上海达梦数据库有限公司 Method for realizing grace hash joint by using merge join
US20130173589A1 (en) * 2011-12-29 2013-07-04 Yu Xu Techniques for optimizing outer joins

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019070340A1 (en) * 2017-10-06 2019-04-11 Microsoft Technology Licensing, Llc Join operation and interface for wildcards
US11010387B2 (en) 2017-10-06 2021-05-18 Microsoft Technology Licensing, Llc Join operation and interface for wildcards
CN111026720A (en) * 2019-12-20 2020-04-17 深信服科技股份有限公司 File processing method, system and related equipment
CN111125011A (en) * 2019-12-20 2020-05-08 深信服科技股份有限公司 File processing method, system and related equipment
CN111026720B (en) * 2019-12-20 2023-05-12 深信服科技股份有限公司 File processing method, system and related equipment
CN111125011B (en) * 2019-12-20 2024-02-23 深信服科技股份有限公司 File processing method, system and related equipment

Also Published As

Publication number Publication date
CN105359142A (en) 2016-02-24
CN105359142B (en) 2019-04-05

Similar Documents

Publication Publication Date Title
US8832350B2 (en) Method and apparatus for efficient memory bank utilization in multi-threaded packet processors
US20190102346A1 (en) Offload of data lookup operations
WO2015176315A1 (en) Hash join method, device and database management system
CN110399535B (en) Data query method, device and equipment
US9871727B2 (en) Routing lookup method and device and method for constructing B-tree structure
JP2005235228A5 (en)
CN108363621B (en) Message forwarding method and device under numa architecture, storage medium and electronic equipment
US8423499B2 (en) Search device and search method
WO2013078583A1 (en) Method and apparatus for optimizing data access, method and apparatus for optimizing data storage
US10049035B1 (en) Stream memory management unit (SMMU)
Tang et al. A data skew oriented reduce placement algorithm based on sampling
Chen et al. Fpga-accelerated samplesort for large data sets
US20160132559A1 (en) Tcam-based table query processing method and apparatus
Li et al. High performance MPI datatype support with user-mode memory registration: Challenges, designs, and benefits
CN113377689B (en) Routing table item searching and storing method and network chip
CN110008030B (en) Method, system and equipment for accessing metadata
Jeong et al. REACT: Scalable and high-performance regular expression pattern matching accelerator for in-storage processing
López-Ortiz et al. Paging for multi-core shared caches
WO2013185660A1 (en) Instruction storage device of network processor and instruction storage method for same
WO2015032214A1 (en) High-speed routing lookup method and device simultaneously supporting ipv4 and ipv6
CN111126619B (en) Machine learning method and device
US8332595B2 (en) Techniques for improving parallel scan operations
US20200097297A1 (en) System and method for dynamic determination of a number of parallel threads for a request
CN112506813B (en) Memory management method and system
Que et al. Exploring network optimizations for large-scale graph analytics

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201480037464.8

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14892265

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14892265

Country of ref document: EP

Kind code of ref document: A1