WO2015176315A1

WO2015176315A1 - Hash join method, device and database management system

Info

Publication number: WO2015176315A1
Application number: PCT/CN2014/078304
Authority: WO
Inventors: 桑永嘉; 李俊; 施会华
Original assignee: 华为技术有限公司
Priority date: 2014-05-23
Filing date: 2014-05-23
Publication date: 2015-11-26
Also published as: CN105359142A; CN105359142B

Abstract

A Hash join method, device and database management system, the method comprising: when dividing a target data group during database query, using vector as a unit of quantity to divide and calculate the Hash value of the original data in a data segment, and representing the Hash value in bits; dividing the original data corresponding to the same Hash value of specified bits into the same group based on a preset grouping rule in Hash grouping, continuing to execute Hash grouping in subsequent grouping by utilizing the unspecified bits in the previous Hash grouping, and in the grouping process, according to the positions of the original data in the target data group, ranking the original data in the same group; and conducting a join operation on the grouped and ranked original data to be joined in the corresponding groups in the target data group, thus reducing the complexity of subsequent ranking of each group.

Description

Hash connection method, device and database management system

TECHNICAL FIELD The present invention relates to the field of database technologies, and more particularly to a hash connection method, apparatus, and database management system. BACKGROUND With the development and application of database technology, the amount of data stored in a database has transitioned from megabytes (M) and gigabytes (G) to the current terabytes (T) and gigabytes ( P). Based on the amount of data that can be stored in the current database, the amount of data that the user needs to face in the process of querying the database is G, T, or even P. In the case of querying such a large amount of data, it is necessary to satisfy the fast response of the query, which poses a great challenge to the processing performance of the database, and the database performance is crucial in the query process. The processing response time of the Join operation (connection operation).

The basic methods for implementing j ₀ i _n operations in the database are mainly Hash Join, Merge Join, and the improved Radix Join algorithm for Grace Join. In the process of querying, the packet and Join are mainly included. To avoid the grouping process, when the number of packets is larger than the TLB entry of the CPU (TLB, Translation Lookaside Buffer, page table buffer, TLB entry refers to the buffer in the LTB. The severe TLB miss caused by the page table entry) (there is no required table page in the TLB). The existing query uses the multi-way packet method to reduce the TLB miss in the grouping phase. At present, the most common query process is as follows: First, grouping is performed by means of multiplexed packets, and the raw data is hashed in each grouping process, and then, after obtaining the multiplexed group, the Join operation is performed.

It can be seen from the above that in the existing database query process, multiple packets that are used in the packet phase need to calculate the hash value multiple times, which may generate a large number of cache misses (cache misses, indicating that the requested data is not in the memory layer to be accessed). ), and the problem of wasting computing resources. Summary of the invention In view of this, an object of the embodiments of the present invention is to provide a hash connection method, apparatus, and database management system, which overcomes the problem of wasting computing resources in the existing database query process.

To achieve the above objective, the embodiment of the present invention provides the following technical solutions:

A first aspect of the embodiments of the present invention provides a hash connection method, which is applied to a database, and includes: receiving a structured query language SQL statement including a connection Join operation, and parsing and acquiring at least two target data groups to be connected;

Dividing each target data group into multiple data segments by a vector vector;

Performing N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the original data in the data segment based on the first hash packet The hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group is classified according to each original. The positions of the data in the target data group are sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;

For each group obtained after N times of hash grouping for each target data group, in the target data group, the groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;

The Join operation is performed by taking the original data in each group obtained after the N times of the hash packets in the target data groups to be connected in order.

In the first implementation manner of the first aspect of the embodiment of the present invention, the performing, by using the preset grouping rule, the first hash packet in the N times hash packet for each data segment in each target data group includes: The hash value of the original data contained in the data segment, and the bit value is used to represent the calculated hash value;

The original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and the original data divided in the same group is in the same position in the target data group according to each original data. Sort and save within the group;

The unspecified bit bits of the hash value corresponding to each original data are associated with the original data and saved;

The performing the second to the nth hash packets in the N hash packets in the data segment in each target data group in sequence based on the preset grouping rule includes: Hash the original data in any group obtained after the last hash grouping, n is included in N, and a positive integer greater than 2 includes:

The original data corresponding to the hash value of the same bit in the current hash grouping process is divided into the same group based on the unspecified bit in the last hash packet associated with the original data in the current group. Internally, and sorting and saving the original data divided into the same group in the same group according to the position of each original data in the target data group;

The remaining unspecified bit bits associated with each raw data are saved again.

The first type of the preset grouping rule involved in the first aspect of the embodiment of the present invention includes: preset the number of hash packets N, or preset the total number of packets S, or preset the number of hash packets N and the total number of preset packets S ;

When the preset grouping rule is the preset hash packet number N, the data segments in each of the target data are hash-grouped in turn until the N-th hash packet is completed;

When the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the number of packets of each of the target data groups is equal to the preset number of packets; The preset grouping rule is a preset hash packet number N and a preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, and is sequentially used in each of the target data groups. The data segment is hashed until the hash packet is completed N times;

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset packet total number S has a higher priority than the preset hash packet number N, sequentially for each of the target data groups The data segment is hashed until the number of packets of each of the target data groups is equal to the total number of preset packets S;

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, sequentially for each of the target data The data segment in the group is hashed until the hash packet is completed N times and the number of packets of each target data group is equal to the total number of preset packets S;

The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2;

The priority of the preset hash packet number N and the preset packet total S is determined by the storage size of the TLB and the size of the cache. The second preset packet rule involved in the first aspect of the embodiment of the present invention includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, N The value is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2; When the data segment in each of the target data is hashed, the packet is grouped according to the preset number of packets of each hash packet, so that the last packet number is equal to the preset hash packet number, and the total number of the divided groups is equal to the preset. The total number of groups.

In a second implementation manner of the first aspect of the embodiments of the present invention, the dividing each target data group into multiple data segments by using a vector vector is:

The vector vector is a quantity unit, one vector corresponds to one data segment, and each target data group is sequentially divided into M data segments, the value of M is determined by the number of original data in the target data group, and the database cache cache The size and size of the page table buffer TLB storage;

The number of the original data included in the first to the M-1th data segments is the same, and the number of the original data included in the Mth data segment is less than or equal to the first to the M-1 data segments. The number of raw data contained.

In a third implementation manner of the first aspect of the embodiment of the present invention, each original data corresponding to a hash value having the same value in a specified bit position is divided into the same group, and each original data divided into the same group is divided. Sorting and saving in the same group according to the position of each original data in the target data group includes: searching for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and each original The data is divided into the same group, wherein the bit size required for the current hash packet is specified according to the size of the database cache cache and the storage size of the page table buffer TLB; traversing the subscripts of each original data divided in the same group, The subscripts of the respective original data are used to identify the location of each original data in the target data group;

According to the size of each subscript, the original data corresponding to each subscript is arranged from small to large;

Each raw data is written into the same group and saved in the order from small to large.

In the first implementation manner of the embodiment of the present invention, based on the unspecified bit in the last hash packet associated with and saved by the original data in the current group, the hash with the same value in the specified bit position in the current hash grouping process is used. The original data corresponding to the value is divided into the same group, and is divided in Each raw data in the same group is sorted and saved in the same group according to the position of each raw data in the target data group, including:

Invoking an unspecified bit in the last hash packet saved at each original data association location within the group currently performing the hash packet;

Determining the bit bits required in the current hash packet process from the unspecified bit bits of the call, wherein the bit bits required in the current hash packet process are slowed according to the size of the database cache cache and the page table. The storage size of the TLB is determined;

Finding each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and dividing each original data into the same group;

Traversing the subscripts of the respective original data divided into the same group, the subscripts of the respective original data are used to identify the locations of the respective original data in the target data group;

In the first implementation manner of the embodiment of the present invention, the performing the Join operation by using the original data in each group obtained by the N times hash group in the target data group to be connected in the order of the following:

Obtaining, in order, the two target data groups to be connected respectively for each group after N hash packets;

The two groups work as a pair of raw data join operations, and perform the Join operation on the original data in each of the two target data groups;

The manner in which the two groups are a pair of original data join operations includes:

Navigating each group in another target data group sequentially by a group in a target data group; if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group , wherein the same group means that the hash value of the original data stored in the group is the same as the hash value of the original data stored in the group for traversing;

After the original data in the group has been subjected to the Join operation, move to the next group to return to the execution sequence to traverse the various groups in the other target data group;

If the same group is not traversed, move to the next group to return to the execution sequence to traverse the various groups in the other target data;

Until all the teams in the target data group are holding on to each group in the other target data group Line traversal operation.

A second aspect of the embodiment of the present invention provides a hash connection apparatus, which is applied to a database, and includes: a receiving unit, configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two to be connected. Target data set;

a dividing unit, configured to divide each target data group into a plurality of data segments by using a vector vector;

a grouping unit, configured to sequentially perform N hash hash packets for each data segment in each target data group based on a preset grouping rule, where, in each hash packet, the data segment is calculated based on the first hash packet The raw data is represented by the bit value, and the original data corresponding to the hash value of the same bit position in the current hash grouping process is divided into the same group, and each original is divided into the same group. Data, sorted and saved in the same group according to the position of each original data in the target data group, and N takes a positive integer greater than or equal to 1;

a sorting unit, configured to obtain a group obtained after N times hash grouping for each target data group, in which the hash value corresponding to the original data included in each group is from small to large for each d, Group sorting;

a connecting unit, configured to perform a Join operation on the original data in each group obtained after the N hash packets in the target data groups to be connected in the order of the two connected data groups.

A third aspect of the embodiments of the present invention provides a database management system, which is applied to a database, and includes:

a memory having a storage medium, wherein the memory stores a program for performing a database query; and a processor connected to the memory via a bus, when the database query is executed, the processor invokes a database query program stored in the memory And executing the database query procedure according to a hash connection method provided by the first aspect of the embodiments of the present invention described above.

According to the above technical solution, the embodiment of the present invention discloses a hash connection method, device and database management system as compared with the prior art. In the method of performing a database query, after determining the target data group to be connected, the target data group is grouped into a plurality of data segments, and then the target data group to be connected is divided into multiple data segments by using a vector vector. And calculating a hash value of the original data included in the data segment, and using the bit bit to represent the hash value; and then, based on the preset grouping rule, calculating the hash value represented by the bit of each original data in the first hash packet, In the process of performing a hash packet, the value of the specified bit in the current hash grouping process is the same. The original data corresponding to the hash value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group.

The embodiment of the present invention can perform hash packet processing on a plurality of original data simultaneously by using a vector as a quantity unit and a hash packet by using a specified bit in the hash grouping process, and does not need to repeatedly calculate the original in the process of multiple hash packets. The hash value of the data, which reduces the cache miss cache miss, also eliminates the need to repeatedly calculate the hash value to avoid the waste of computing resources.

And each time the grouping is divided into the original data in each group, so that the original data in each group obtained after grouping the plurality of data segments is locally ordered, and when the local ordered original data is joined, The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.

BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are merely embodiments of the present invention. For those skilled in the art, other drawings may be obtained according to the provided drawings without any creative work.

FIG. 1 is a flowchart of a hash connection method according to Embodiment 1 of the present invention;

2 is a schematic diagram of a third-time hash packet disclosed in Example 4 of the third embodiment of the present invention; FIG. 3 is a schematic diagram of the same original data included in each data segment disclosed in Embodiment 4 of the present invention; FIG. 5 is a schematic diagram of a grouping of raw data in a data segment according to Embodiment 4 of the present invention; FIG.

6 is a flowchart of dividing a group in a second to Nth hash grouping process according to Embodiment 4 of the present invention;

7 is a flowchart of performing a Join operation on original data in each group of two target data to be connected according to Embodiment 4 of the present invention;

FIG. 8 is a schematic structural diagram of a hash connection apparatus according to Embodiment 5 of the present invention; FIG.

FIG. 9 is a schematic structural diagram of a database management system according to Embodiment 5 of the present invention. detailed description For the purposes of reference and clarity, the description, abbreviations or abbreviations of the technical terms used below are summarized as follows:

TLB, Translation Look aside Buffer, page table buffer, TLB entry refers to the page table entry cached in LTB;

Radix Join, aggregate connection;

Cache miss, cache miss, means that the requested data is not in the memory layer to be accessed.

BRIEF DESCRIPTION OF THE DRAWINGS The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without the creative work are all within the scope of the present invention.

It can be seen from the background that in the current commonly used query process, the multiplexed packets used in the grouping phase need to use one method for processing the original data in each grouping process, and the hash value of the original data needs to be calculated multiple times. Thus facing the problem of wasting computing resources. Therefore, an embodiment of the present invention provides a hash connection method, apparatus, and data management system, which can implement a hash packet by using a vector vector as a quantity unit and using a specified bit bit in a current hash grouping process in a subsequent grouping process. At the same time, hash data processing is performed on a plurality of original data, and the hash value of the original data does not need to be repeatedly calculated in the process of multiple hash packets, that is, the cache miss cache is reduced, and the hash value is repeatedly calculated, thereby avoiding the computing resource. Waste. At the same time, each hash group is divided into the original data in each group, so that the original data in each group obtained after grouping multiple data segments is locally ordered, when the local ordered original data is joined. The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join. The specific process is described in detail by the following embodiments of the present invention.

Embodiment 1

The first embodiment of the present invention discloses a hash connection method, and the method is applied to a database. The process is as shown in step S101 to step S105 in FIG.

Step S101: Receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;

In the process of executing the database query, step S101 is executed, and the received SQL query statement containing the Join operation is parsed by the database, and at least two target data to be connected are obtained. Group. That is to say, two target data groups to be connected are paired, and at least two target data groups to be connected appear in the process of parsing, that is to say, the target data groups to be connected are parsed in pairs.

Step S102, dividing each target data group into a plurality of data segments of the determined data by using a vector vector as a quantity unit;

In step S102, the same operation is performed on the parsed pair of target data groups to be connected, and a target data group is taken as an example in the process of dividing the data segments.

The current target data set is divided by the vector vector. Specifically, the unit of the vector refers to how many pieces of raw data are contained in a vector as a fixed unit. The target data group is divided into a plurality of data segments by using the vector quantity unit, that is, one data segment corresponds to one vector.

It should be noted that, in a normal case, the maximum number of original data that can be included in one data segment is one vector unit, and the target data group is divided into multiple data segments, and the divided data segments are included in the data segment. The number of raw data is usually the same. Of course, there is also a limitation on the number of original data contained in a quantity unit vector according to a preset grouping rule and the total number of original data in the target data group, not the most original data that can be included therein. The number of units is limited to the quantity unit vector.

The above two methods do not exclude the case where the number of original data contained in the last data segment is smaller than the number of original data contained in other data segments.

Based on the above manner, after performing step S102, each target data group to be connected can be divided into a plurality of data segments. In the embodiment of the present invention, a vectorization method is used. In the process of subsequent hash grouping, a vector is used as a quantity unit, and a hash value is simultaneously calculated for the original data in the vector, and then several original data in the same group are compared. Write the corresponding group at one time, which reduces the cache miss and improves the join performance.

Step S103: Perform N times hash grouping on the data segments in each target data group in sequence according to the preset grouping rule, where, in each hash grouping, calculate the original data in the data segment based on the first hash group. The hash value represented by the bit bit is used to divide the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and the original data divided into the same group, according to each The position of the original data in the target data group is sorted and saved in the same group, and N takes a positive integer greater than or equal to 1;

In the process of performing the above step S103, each target data is sequentially performed based on a preset grouping rule. The data segments in the group are hashed N times. In the first hash grouping process, taking a target data group as an example, starting from the first data segment of the target data from the top to the bottom, the hash packet is ended to the end of the last data segment. Taking a data segment as an example, when the first hash packet is performed, the hash value is calculated simultaneously for all the original data contained in the data segment, and the hash value of each original data is represented by a bit, and the bit is installed. The number of bits in the database itself is determined by the maximum number of CPUs currently CPU of the computer.

For example, if the computer currently installing the database is 32 bits, the hash value corresponding to the original data calculated during the first hash grouping process is represented by a 32-bit bit. If the computer on which the database is currently installed is 64-bit, the hash value corresponding to the original data calculated during the first hash grouping is represented by a 64-bit bit.

Then, according to the number of bits used in the current first hash packet, that is, the specified bit, the comparison is performed on the specified bit of each hash value represented by the bit, or traversed, or searched at the specified bit. The hash value with the same value is set, and the original data corresponding to the hash value is divided into the same group. For example, if the number of bits required for the first hash packet is 2 bits, then the highest bit of the hash value indicated by each bit is used, and the two bits are compared backwards, or traversed, within the group. .

Finally, for each raw data divided into the same group, the position of each raw data in the target data group is sorted within the group, and the position can also be considered as the position of each original data in the data segment. For example, the original data A, B, and C are divided into the same group. If A is ranked 3rd in the target data group, B is ranked 1st in the target data group, and C is ranked 6th in the target data group. After sorting, the actual storage order of A, B, and C in the group is: B, A, C.

It should be noted that the process of performing the first hash packet for each data segment from the top to the bottom is the same, and the designated bit bit is sequentially started from the undesired highest bit bit from the start of the first hash packet. . In the process of performing N times hash value grouping, after the first hash packet needs to calculate the hash value of the original data, in the subsequent hash grouping process, only the unspecified bit bits of the hash values corresponding to the original data are used for hashing. Grouping, dividing the original data corresponding to the hash value of the same bit used in the current hash grouping process into the same group, and dividing it into the same group in the same manner as the first hash grouping Each raw data in the row, according to the position of each raw data in the target data group or the data segment, the original data is arranged in the group. Preface.

The preset grouping rule mentioned in the step S103 is to preset the number of hash packets N, or the preset total number of packets S, or preset the number of hash packets N and the total number of preset packets S; and, preset the number of hash packets N, the preset number of packets m of each hash packet and the total number of preset packets S. The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2.

Step S104: For each group obtained after each target data group has been hashed by N times, in the target data group, each group is sorted according to the hash value corresponding to the original data included in each group, from small to large. ;

In step S104, each group in the target data group obtained after performing N hash grouping according to the preset grouping rule is reordered. The way is: Sort the groups according to the hash value of the raw data contained in the group. For example: After grouping the target data sets, get group 1, group 2, and group 3; where, the raw data contained in group 1 has a hash value of 3, and the raw data contained in group 2 has a hash value of 5, group 3 The raw data contained in the hash value is 0. After sorting, the order of the groups in the target data group is: Group 3, Group 1 and Group 2.

It should be noted that each group obtained after performing N-time hash grouping according to a preset grouping rule, the original data that is finally divided into the same group usually corresponds to the same hash value.

Step S105: Perform the Join operation on the original data in each group obtained after the N times hash group in the target data groups to be connected according to the ranking.

Step S105 is performed for each target data group to be connected, for the target data groups to be connected after sorting the original data in the same group divided by the hash grouping process in which the step S102 to the step S104 are performed. An ordered group, in sequence, joins a group in a target data group to be connected with another group in the target data group to be connected, and performs the Join operation on the ordered raw data in each group. Thus the task of the current database query is realized.

In the prior art, since the TLB entry of the hardware is larger than the number of cache ways, grouping the hash values calculated by one by one easily leads to a large amount of cache thrashing, thereby generating a large number of cache misses, which affects the performance of the original join. According to the hash connection method disclosed in the first embodiment of the present invention, the hash value is calculated in groups by a vector, and then the same group is grouped. The hash values corresponding to several original data contained in the one-time data are written into the corresponding group at one time. Hash grouping in the form of a vector can avoid unnecessary cache thrashing, which reduces the cache miss and improves the performance of Join. Moreover, the hash value of each original data is calculated only in the first grouping process, and the number of bits used later are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.

In the meantime, in the process of performing the hash grouping by the hash connection disclosed in the first embodiment of the present invention, the raw data in each group is sorted after each hash packet is written into each corresponding group. After the final grouping is completed, when the final sorting is performed for each group, since the original data has been partially sorted in the process of the multiplex grouping disclosed in the embodiment of the present invention, the original data in each group is locally The order is up, so you only need to sort the groups. In this way, the complexity of sorting the original data and the individual groups in each group after the grouping is completed in the prior art can be greatly reduced, and the time consumed by the sorting is reduced. And when this locally ordered raw data is joined, the sorting complexity is lower than the sorting complexity when the randomly allocated raw data is joined.

Embodiment 2

The hash connection method disclosed in the first embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the N times hash packets mentioned in step S103 shown in FIG.

The process of sequentially performing the first hash packet in the hash packet for each data segment in each target data group based on the preset grouping rule includes:

Step S1031: Calculate a hash value of the original data included in the current data segment, and use a bit bit to represent the calculated hash value.

The target data group is divided by a vector in a quantity unit according to the execution step S102. Taking any one of the target data groups as an example, when performing step S1031, the hash of each original data included in the same data segment is simultaneously calculated. The value, and the bit value is used to represent the hash value obtained by calculating each raw data. As described in the first embodiment of the present application, the bit bit is related to the number of bits of the computer itself in which the database is installed, and is determined by the maximum number of CPUs currently being the CPU of the computer. According to the division into the same group, and sorting and saving the original data divided into the same group in the same group according to the position of each original data in the target data group; In the process of performing step SI 032, in the process of performing the first hash sub-packet, according to the size of the data cache cache and the storage size of the page table buffer TLB, the bit bits required for the current hash packet are determined, The hash value represented by the bit bit corresponding to each original data in the data segment is divided into the same group in the process of dividing the group by the original data corresponding to the hash value of the same bit position.

For example, if two bits are needed in the current grouping process, the hash value represented by the current bit bit is specified from the highest bit to the lowest bit direction, and when the group is divided, the same first two bits of the same hash value are corresponding. The raw data is divided into the same group.

At the same time, when the original bit data can be divided into the same group according to the specified bit position, the position of the original data in the target data group is used to sort in the current group. For example, the original data is included in the same group: A, B, C, where A is at the 6th position of the target data group, B is at the 1st position of the target data group, and C is at the position of the target data group. 4 bits, the position of the original data in the saved group obtained after executing step S1033 is: B, C, A, so that the original data in each group obtained by each division is ordered.

Step S1033: Associate the unspecified bit bits of the hash value corresponding to each original data with the original data, and save the associated bits of the original data corresponding to each hash value;

Step S1032, after performing the step S1033, after the group is divided, the hash value corresponding to the original data is not used in the hash packet process, or the unspecified bit bit is saved at the associated position of the original data. The associated location may be a storage space adjacent to the original data, or may be another storage space associated with the original data.

After the first hash packet is executed for each data segment in the target data group, if the preset grouping rule is satisfied, the re-grouping is stopped. If the preset grouping rule is not met, the original data in each group after the current first hash grouping is continued to be grouped again.

The raw data in any one of the groups obtained after the last hash grouping in the second to nth hash packets is hash grouped, and n takes a positive integer greater than 2 and is included in N. The above process of sequentially performing the second or even nth hash packets in the N segments of the data segments in each target data group based on the preset grouping rules includes:

Step S1034: According to the unspecified bit in the last hash packet saved in the original data association position in the current group, the original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process is divided. Within the same group, and for each of the original groups divided into the same group Starting data, sorting and saving each original data in the same group according to the position of each original data in the target data group;

In the process of performing step S1034, according to the bit bits required for the current hash packet specified in the bit position saved at the original data associated position, the hash value corresponding to the same value is assigned to the hash value of the same bit position. The original data is in the same group, and at the same time, according to the same bit position, when the original data can be divided into the same group, the position of the original data in the target data group is used, and the current data is sorted in the current group. .

Step S1035: Save the remaining unspecified bit bits associated with each original data again at the associated position of the original data;

In step S1035, the remaining unspecified bit bits are again saved at the associated position of the original data for use in subsequent packets. In conjunction with the example in step S1032, the bit bit currently held at the original data associated position is the unused bit remaining after performing step S1032. If the bit bit currently used for the hash packet is still two bits, the same two bits are the two bits taken from the highest bit of the current remaining bit to the lowest bit.

After the step S1034 and the step S1035 are performed, if the current grouping situation does not satisfy the preset grouping rule, the loop returns to step S1034 and step S1035 until the current grouping of the target data group is stopped.

By performing steps S1031 to S1035, the target data group is grouped to satisfy the preset grouping rule, and the original data divided in the same group is sorted in each grouping process, so that each time the hash grouping process is obtained Although the grouping results are disordered as a whole, they are ordered in each group obtained. When the local ordered raw data is joined, the sorting complexity is lower than the randomly assigned raw data. Sorting complexity when joining.

According to the second embodiment of the present invention, the hash value of each original data is calculated only in the first grouping process, and the subsequent used bits are recorded to the corresponding associated positions of the original data for the subsequent grouping process. Used directly in the middle, thus eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources. At the same time, after each hash grouping, before the original data is written into each corresponding group, the original data in each group is sorted, so that after the last hash group is completed, the original data in each group is partially Ordered, so only the groups obtained after grouping the target data group hash need to be sorted. In this way, the original data and the groups in each group can be sorted after the grouping is completed in the prior art. The complexity of reducing the time spent by sorting.

Embodiment 3

The method for the hash connection according to the first embodiment and the second embodiment of the present invention is mainly described in detail in the second embodiment of the present invention for the preset grouping rule mentioned in step S103 shown in FIG.

When the preset grouping rule is the preset number of hash packets N, in the process of performing hash grouping on the data segments in each of the target data groups, the target data group is stopped after completing the hash packets by N times. Group by. The value of N is determined by the storage size of the page table buffer TLB, and is a positive integer greater than or equal to 1.

In the first example, it is determined by the storage of the page buffer TLB that the target data group currently performing the hash packet needs to be divided into four times, that is, the value of N is 4. After performing the first grouping, the process of performing hash grouping on the original data in any one group obtained after the last hash grouping in the second to nth hash packets disclosed in the first embodiment of the present invention is performed. After the execution of the 4th grouping, the hash grouping of the target data group is stopped. At this time, the obtained number of groups is the number of groups of the target data group.

When the preset grouping rule is the preset total number S of packets, hashing the data segments in each of the target data groups in turn, until the total number of packets of each of the target data groups is equal to the total number S of preset packets. Stop grouping the target data group. The value of S is determined by the size of the database cache cache and is a positive integer greater than or equal to 2.

Example 2: When the total number of preset packets that can be divided by the current target data group determined by the size of the database cache cache is 10, the first hash packet is performed for the current target data group, and after the first hash packet is completed, the obtained If the number of packets is less than 10, the hash packet is continued until the number of packets of the current target data group reaches 10, and the hash packet is stopped.

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, sequentially for each of the target data groups The data segment is hashed until the hash packet is completed N times;

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is consistent with the priority of the preset packet total S, sequentially for each of the The data segment in the target data group is hashed until the hash packet is completed N times and the number of packets of each of the target data groups is equal to the total number of preset packets S;

The priority of the preset hash packet number N and the preset total number S is determined by the storage size of the TLB and the size of the cache.

Example 3: The preset number of packets determined by the storage size of the page table buffer TLB is 3, and the total number of preset packets determined by the size of the database cache cache is 16. When the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the total number of packets obtained after the target data group is grouped 3 times based on the preset packet number is exactly 16; when the default hash is obtained When the priority of the number of packets N is higher than the total number of preset packets S, after the target data group is grouped 3 times based on the preset number of packets, there may be a case where the total number of packets obtained is less than 16, or equal to 16, Or greater than 16; when the priority of the preset total number S is higher than the preset hash packet number N, in the process of grouping, there may be a case where, when the total number of packets is 16, the target data group is obtained. The number of groupings is greater than 3 times, or less than 3 times, or equal to 3 times.

The preset grouping rule includes: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; wherein, the value of N is determined by the storage size of the page table buffer TLB , is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, is a positive integer greater than or equal to 2; when hashing the data segments in each of the target data in turn The packet is grouped according to the preset number m of packets of each hash packet, so that the last number of packets is equal to the preset hash packet number N, and the total number of divided groups is equal to the total number of preset packets 8.

Example 4, as shown in FIG. 2, the preset number of packets determined by the storage size of the page table buffer TLB is 3, the number of packets per hash packet is 2, and the total number of preset packets determined by the size of the database cache cache is 16. In the target data group divided into two data segments in the unit of vector, each data segment is subdivided into two groups in the first hash grouping process, and respectively written into the corresponding group; In the second hash grouping process, each group after the previous grouping is again divided into two data segments and written into the corresponding groups, and so on until the hash group is executed for the target data group and 16 is obtained. Groups.

In the second embodiment of the present invention, the preset grouping rule based on the hash grouping process mentioned in step S103 shown in FIG. 1 is mainly explained. The preset grouping rule is mainly determined based on the storage size of the page table buffer TLB in the computer according to the database, and the size of the database cache cache. Based on the preset grouping rule, the cache miss may be avoided during the grouping process. Enter Improve the performance of subsequent Join.

Embodiment 4

A hash connection method according to the first embodiment to the third embodiment of the present invention, wherein, for step S102 shown in FIG. 1, the target data group is divided into a plurality of data segments by a vector vector. The specific process includes:

Suppose that the target data group that needs to be hashed contains a total of 25 original data, with a vector as a quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment. The target data group containing 25 raw data is divided into five data segments by dividing the target data unit by the number of vectors. The original data contained in the 1st to 5th data segments is the same, as shown in Fig. 3, the case where the number of original data included in each data segment is the same.

Suppose that the target data group that needs to be hashed contains a total of 28 original data, with a vector as the quantity unit, and the vector quantity unit contains 5 original data, so that 5 original data constitute one data segment. The target data group containing 28 raw data is divided into six data segments by the vector number unit. The original data contained in the 1st to 5th data segments is the same, and the 6th data segment contains 3 raw data, which is smaller than the original data contained in the 5th data segment of the 1st value.

A hash connection method according to Embodiment 2 of the present invention, wherein the steps disclosed in the above disclosure are divided into the same group, and each piece of original data divided in the same group is in the target data group according to each original data. The location in the same group sorts and saves the original data in the same group. The specific process is shown in Figure 4, including: the hash value;

Step S202, searching for a hash with the same value in the specified bit position in the current hash grouping process. Each raw data corresponding to the value divides each original data into the same group;

Based on the hash value of each original data in the data segment of the current hash packet obtained in step S201, the hash value is represented by a bit. In step S202, the hash value on the specified bit is looked up. The specified bit bit may be specified according to the size of the database cache cache and the storage size of the page table buffer TLB before the current packet is performed; or may be based on the size and page table of the database cache cache when receiving the hash packet needs to be received. The storage size of the buffered TLB is used to specify the bit bits to be used in the subsequent grouping process. When this grouping is performed, there is no need to re-specify, directly in the bit position required for this hash packet. Find it.

Step S203, traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;

Step S204: Arrange the original data corresponding to each subscript from small to large according to the size of each subscript;

Step S205: Write each original data into the same group and save according to the sequence from small to large.

The above steps S203 to S205 are performed to sort the original data divided in the same group and write them in the same group during the grouping process, so that the order is locally ordered in the process of the target data group. For example, in the process of hash grouping, a piece of data in units of vectors (shown by a dashed box in FIG. 5) is used to calculate a hash value together with the original data in the data segment. As shown in Fig. 5, value is the real value of the participating join, position in Figure 5 represents the position of each original data in the entire data segment, and position-1 represents the subscript of each original data that is sorted and sorted in the same group, hash Value represents the hash value corresponding to the original data.

In the process of grouping, traversing the hash value with the same value on the specified bit position, saving the subscript to the group corresponding to position-1, and then traversing the subscript saved in position-1 in turn, and substituting the subscript The corresponding raw data is written to the corresponding group.

By performing the above steps S203 to S205, in the process of grouping, the original data that needs to be written into the current group is sorted while the original data is written into the current group. After the vector unit performs the above grouping, the next adjacent vector is operated as above until all the vectors in the target data group have completed the current hash group. In turn, the local hash group after the first hash group of the target data group is obtained, thereby sharing the burden of sorting the original data in the final sorting of each group, thereby realizing the reduction of group complexity. purpose. After the first hash packet is executed in the above manner for each data segment in the target data group currently grouped, if the current packet satisfies the preset packet rule, the re-grouping is stopped. If the preset grouping rule is not satisfied, the original data in each group after the current first hash grouping is continued to be grouped again. A hash connection method according to the second embodiment of the present invention, wherein, for the step S1034 disclosed above, based on the unspecified bit in the last hash packet saved at the original data association location in the current group, the current Each raw data corresponding to the same hash value in the specified bit position in the hash grouping process is divided into the same group, and each original data divided in the same group is in the target data group according to each original data. The locations are sorted and saved in the same group. The specific process is shown in Figure 6, including:

Step S301: Calling an unspecified bit bit in the last hash packet saved at each original data association location in the group currently performing the hash packet;

In the process of performing step S301, the current group calls any one of the groups obtained after the last hash group, and calls the unspecified bit in the last hash packet saved in the original data association position in the current group. Bit, is for further current group to perform hash grouping again.

Step S302, determining, according to the unspecified bit bit of the call, a bit bit required for the current hash packet process, where the bit bit required in the current hash packet process is based on the size of the database cache cache and The storage size of the page table buffer TLB is determined;

Step S303: Find each original data corresponding to the hash value with the same value in the specified bit position in the current hash grouping process, and divide each original data into the same group.

Step S304, traversing subscripts of each original data to be divided into the same group, and the subscripts of the respective original data are used to identify the location of each original data in the target data group;

Step S305: Arrange, according to the size of each subscript, each original data corresponding to each subscript from small to large;

Step S306: Write each original data into the same group and save according to the sequence from small to large.

The sorting process of the original data divided into the same group in the above steps S304 to S306 is the same as the step S203 to the step S205 in the above-mentioned FIG. 4, and the detailed description is not mentioned here.

Performing the above step S301 to each group obtained by the previous hash group of the target data group Step S306, thereby obtaining a new group with internal raw data ordered after hashing again. Similarly, after each hash packet is finished, if the current hash packet satisfies the preset grouping rule, the hash packet is stopped. If the preset grouping rule is not satisfied, step S301 to step S303 are performed to group the groups obtained by the previous hash group again until the preset grouping rule is satisfied.

A hash connection method according to the above-mentioned first embodiment of the present invention to the third embodiment of the present invention, wherein, in step S105 of the above disclosure, the two target data groups to be connected are sequentially obtained by N in sequence The raw data in each group obtained after the hash group is joined, and the specific process includes:

Step S501: Acquire, in sequence, each of the two target data groups to be connected to perform N times hash grouping;

After the at least two target data groups to be connected are hashed according to the foregoing steps S102 to S104, step S501 is executed to obtain each group in the two target data groups to be connected.

Step S502: The two groups perform a Join operation on the original data in each group of the two target data groups in a manner of performing a raw data Join operation.

For the two target data groups to be connected after the hash grouping, the raw data join operation is performed according to the pair of two groups, and the original data in each group in the two target data groups to be connected is joined. As shown in Figure 7, it includes:

Step S503, sequentially traversing each group in another target data group by a group in a target data group;

Step S504, it is determined whether the current group traverses to the same group in another target data group, and if so, step S505 is performed, and if no, step S507 is performed;

Step S505, if traversing to the same group, the original data in the group is sequentially joined with the original data in the same group, wherein the same group refers to the hash of the original data stored in the group. The value is the same as the hash value of the raw data stored in the group used for traversal;

Step S506, determining whether the original data in any one of the two groups currently performing the Join operation has performed the Join operation, and if yes, executing step S507, and if not, continuing to perform the Join operation of the original data in the two groups. And returning to step S506;

Step S507, moving to the next group returns to step S503;

Cycling the above steps S503 to S507 until all the groups in the target data group A traversal operation is performed on each group in another target data group.

In the embodiment of the present invention, the hash connection performs grouping and the process required to be performed in the Join process. In the vector unit, the hash value is calculated simultaneously for the original data in each vector unit in the first grouping process, and then the hash values corresponding to the plurality of original data included in the same group are written to the corresponding one-time. In the grouping. The subsequent use of a number of bit bits is recorded to the corresponding location of the corresponding original data for use in the subsequent grouping process, thereby eliminating the cost of repeatedly calculating the hash value and avoiding waste of resources.

Meanwhile, in the embodiment of the present invention, after each hash packet is written, the original data in each group is sorted before the original data is written into each corresponding group, and each group is performed after the completion of the hash grouping. Sorting, so that after the final grouping is completed, the final sorting of each group can reduce the burden of sorting the data of the group and the internal data of the group, and reduce the time consumed by the sorting.

Embodiment 5

A specific embodiment of the hash connection management system disclosed in the above-described first embodiment of the present invention to the fourth embodiment of the present invention will be described in detail below.

As shown in FIG. 8, the hash connection apparatus is applied to a database, and mainly includes: a receiving unit 101, a dividing unit 102, a grouping unit 103, a sorting unit 104, and a connecting unit 105.

The receiving unit 101 is configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;

After the receiving unit 101 is executed, for each target data group obtained by parsing, a subsequent dividing unit 102 is performed, and the grouping unit 103 and the sorting unit 104 undergo division, grouping and sorting, and then enter the connecting unit 105 to make the grouped waiting. The two target data groups connected perform a Join operation.

The dividing unit 102 is configured to divide each target data group into multiple data segments by using a vector vector as a quantity unit;

The grouping unit 103 is configured to perform N times hash hash grouping on the data segments in each target data group in sequence according to a preset grouping rule, where the data segment is calculated based on the first hash group each time the hash grouping is performed. The hash value represented by the bit in the original data is divided into the same group by the hash data corresponding to the same bit value in the current hash grouping process, and is divided into the same group. Raw data, according to each raw data in the target number Sorting and saving in the same group according to the position in the group, N takes a positive integer greater than or equal to 1; Sorting unit 104 is used to obtain a group obtained after N times hash grouping for each target data group, at the target In the data group, the ds and groups are sorted according to the hash value corresponding to the original data contained in each group from small to large;

a connecting unit 105, configured to sequentially take the two target data groups to be connected according to the sorting

The raw data in each group obtained after N hash packets is joined.

The grouping unit 103 includes: a first hash grouping and a hash grouping module 1031 for the data segments in the target data group from top to bottom; and, in any group obtained after the last hash grouping The plurality of hash grouping modules 1032 of the second to nth hash packets of the original data, n taking a positive integer greater than two;

The primary hash grouping module 1031 is configured to calculate a hash value of the original data included in the current data segment, and use the bit bit to represent the calculated hash value; and the hash value corresponding to the same bit position is corresponding to the hash value. The original data is divided into the same group, and each raw data divided into the same group is sorted and saved in the same group according to the position of each original data in the target data group; the hash corresponding to each original data is The unspecified bit in the value is associated with the original data and saved;

The multiple hash grouping module 1032 is configured to: use the unspecified bit in the last hash packet associated with and saved by the original data in the current group, and set the hash with the same value in the specified bit position in the current hash grouping process. The original data corresponding to the value is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group; The remaining unspecified bits of the original data association are saved again.

For the specific process and the principle of the above, refer to the disclosure of the first embodiment of the present invention and the second embodiment of the present invention, and no further description is made herein. It should be noted that the content performed by the grouping unit 103 based on different preset grouping rules is also different.

When the preset grouping rule is a preset hash packet number N, the grouping unit is configured to perform hash grouping on data segments in each of the target data groups in sequence, until N times hash packets are completed; When the preset grouping rule is the preset total number S of packets, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of groups of each target data group is equal to a preset. Number of groups When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, the grouping unit is used to sequentially The data segment in the target data group is hashed until the hash packet is completed N times; when the preset packet rule is the preset hash packet number N and the preset packet total number S, and the preset packet total S is prioritized When the level is higher than the preset hash packet number N, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of packets of each target data group is equal to a preset group. Total number S;

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the priority of the preset hash packet number N is the same as the priority of the preset packet total S, the grouping unit is used to sequentially Performing a hash grouping on the data segments in each of the target data groups until the N times hash packets are completed and the number of packets of each of the target data groups is equal to the preset total number S of packets;

When the preset grouping rule includes a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S, the grouping unit is configured to group each hash according to a preset The number of packets is grouped such that the last number of packets is equal to the number of preset hash packets, and the total number of groups divided is equal to the total number of preset packets;

The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, N contains n, and m is less than N; the value of S is determined by the size of the database cache cache, which is greater than or equal to 2. A positive integer; the priority of the preset hash packet number N and the preset total number of packets S is determined by the storage size of the TLB and the size of the cache.

For an example of the different preset grouping rules corresponding to the grouping unit 103, refer to the example given in the third embodiment of the present invention, and no further description is made here.

It should be noted that, the execution unit and the principle of the dividing unit 102 shown in FIG. 8 are divided into the above-mentioned "the vector vector is a quantity unit to divide each of the target data groups. The descriptions for the multiple data segments are the same, and are not described here. They mainly include:

a first dividing module, configured to use a vector vector as a quantity unit, a vector corresponding to a data segment, and sequentially dividing each target data group into M data segments, wherein the value of M is determined by the original data in the target data group The number, and the size of the database cache cache and the storage size of the page table buffer TLB;

Wherein, the number of original data included in the first to the M-1th data segments is the same, the Mth The number of original data included in the data segment is less than or equal to the number of original data contained in the first to M-1 data segments.

It should be noted that the original data corresponding to the hash value with the same value in the specified bit position is divided into the same group, and each original data divided in the same group is in accordance with each original data. For the location of the target data group, the first hash grouping module 1031 that is sorted and saved in the same group, the specific execution process and the principle can be referred to the first hash detailed description section disclosed in the third embodiment of the present invention. There is no longer a comment here, which mainly includes: the hash value represented by the bit;

The first search sub-module is configured to search for each original data corresponding to the same hash value in the specified bit position in the current hash grouping process, and divide each original data into the same group, wherein, according to the size of the database cache cache and The storage size of the page table buffer TLB specifies the bit bits needed for the current hash packet;

a first traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;

The first sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.

It should be noted that, according to the unspecified bit in the last hash packet associated with and saved by the original data in the current group, the hash value corresponding to the same bit in the current hash grouping process is corresponding to the hash value. Each of the original data is divided into the same group, and each of the original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group. 1032, the specific implementation process and the principle can be referred to the detailed description of the multiple hash packets disclosed in the above-mentioned first embodiment to the fourth embodiment of the present invention, and details are not described herein.

Calling a sub-module for invoking an unspecified bit in the last hash packet saved at each original data associated location within the group currently performing the hash packet;

Determining a sub-module, configured to determine a bit bit to be used in a current hash packet process from the unspecified bit position of the call, where a bit number used in a current hash packet process is used According to the size of the library cache cache and the storage size of the page table buffer TLB;

The second search sub-module is configured to search for each original data corresponding to the hash value of the same bit position in the current hash grouping process, and divide each original data into the same group;

a second traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange each original data corresponding to each subscript from small to large according to the size of each subscript;

The second sorting sub-module is configured to write each original data into the same group and save according to the order from small to large.

It should be noted that, the specific execution process and the principle of the connection unit 105 can be referred to the detailed description of the Join operation in the fourth embodiment of the present invention, and details are not described herein.

An obtaining module, configured to respectively acquire, in sequence, the two target data groups to be connected to each group after the N times hash grouping;

The Join module is used to perform the Join operation of the raw data in each group of the two target data groups by performing a Join operation of the original data for the pair of two groups;

The Join module includes:

a third traversal sub-module for sequentially traversing each group in another target data group by a group in a target data group; if traversing to the same group, executing the first Join sub-module; if not traversing to the same group Moving to the next group to return to the second traversal sub-module; until all groups in the target data group perform traversal operations on each of the other target data groups;

The first Join sub-module is configured to perform a Join operation on the original data in the group that is traversed, and the original data in the same group, wherein the same group refers to the original stored in the group. The hash value of the data is the same as the hash value of the original data stored in the group for traversing; after the original data in the group has been subjected to the Join operation, moving to the next group returns to the third traversal sub-module.

Embodiment 5 of the present invention discloses a hash connection apparatus corresponding to the execution of the hash connection method described above. Based on the units and modules disclosed above, in the process of performing a hash grouping on a target data group, the hash is calculated in groups by a vector. The value, and then the hash value corresponding to several original data included in the same group is once written into the corresponding group. Group by vector It can avoid unnecessary cache thrashing, which can reduce the cache miss and improve the performance of Join. Moreover, the hash value of each original data is calculated only in the first grouping process, and the number of bits used in the subsequent use are recorded to the associated position of the corresponding original data for use in the subsequent grouping process, thereby eliminating duplication. Calculate the cost of the hash value and avoid waste of resources.

At the same time, in the process of hash grouping, before each hash grouping, before the original data is written into each corresponding group, the original data in each group is sorted, and finally, when each group performs the final sorting, only You need to sort each group. In this way, the complexity of sorting the original data and the groups in each group after the grouping is completed in the prior art can be greatly reduced, and the time consumed by the sorting is reduced.

The hash connection method described in connection with the embodiments of the present disclosure can be implemented directly in hardware, in a memory executed by a processor, or a combination of both in a data management system. Accordingly, the present invention also discloses a data management system in accordance with the method and apparatus disclosed in the above embodiments of the present invention. Specific embodiments are given below for detailed description.

As shown in FIG. 9, the data management system 1 includes a memory 11 and a processor 13 connected to the memory 11 via a bus 12.

The memory 11 has a storage medium in which a program for performing a database query is stored.

The memory 11 may contain high speed RAM memory and may also include non-volatile memory such as at least one disk memory.

The processor 13 is connected to the memory 11 via a bus 13, and the processor 13 calls the database query program stored in the memory 11 when performing a database query. The database query program may include program code, and the program code includes a series of operation instructions arranged in a certain order. Processor 13 may be a central processing unit CPU, or a specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.

The program for performing data scheduling invoked by the processor 13 may specifically include:

Receiving a structured query language SQL statement including a join operation, parsing and acquiring at least two target data groups to be connected;

Dividing each target data group into multiple data segments by a vector vector;

Performing N hash hash packets for each data segment in each target data group in sequence based on a preset grouping rule, wherein, in each hash packet, calculating the data segment based on the first hash packet The hash value represented by the bit data obtained by the original data divides the original data corresponding to the hash value of the same bit position in the current hash grouping process into the same group, and divides the original data divided into the same group. , sorting and saving in the same group according to the position of each original data in the target data group, and N is a positive integer greater than or equal to 1;

In summary:

The embodiment of the present invention discloses that by performing the hash packet by using the vector as the quantity unit and using the unspecified bit in the previous hash grouping process in the subsequent grouping process, it is possible to perform hash packet processing on several original data at the same time, and multiple times. In the process of hash grouping, it is not necessary to repeatedly calculate the hash value of the original data, that is, the cache miss cache is reduced, and the hash value is repeatedly calculated to avoid waste of computing resources. At the same time, in the process of grouping, according to the position of the original data in the target data group, the original data divided in the same group is sorted, thereby achieving the purpose of reducing the complexity of sorting each group. And each time the grouping is divided into the original data in each group, so that the original data in each group obtained after grouping the plurality of data segments is locally ordered, and when the local ordered original data is joined, The sorting complexity is lower than the sorting complexity when the raw data is randomly assigned to join.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part. The steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both. The software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be apparent to those skilled in the art.

Claims

Rights request

A hash connection method, which is characterized by being applied to a database, comprising:

Dividing each target data group into multiple data segments by a vector vector;

The method according to claim 1, wherein the performing the first hash packet in the N times hash packet for each data segment in each target data group according to the preset grouping rule comprises: calculating a current location The hash value of the original data contained in the data segment, and the bit value is used to represent the calculated hash value;

The performing the second to nth hash packets in the N times of the hash packets in the data segment in each target data group according to the preset grouping rule includes:

Hash grouping the original data in any group obtained after the last hash grouping, n is included in N, and a positive integer greater than 2 includes:

Not specified in the last hash group associated with and saved based on the original data in the current team Bits, the original data corresponding to the hash value of the same bit in the current hash grouping process are divided into the same group, and each original data divided in the same group is in accordance with each original data. The locations in the target data set are sorted and saved in the same group;

The method according to claim 1 or 2, wherein the preset grouping rule comprises: a preset hash packet number N, or a preset total number of packets S, or a preset hash packet number N and a preset packet Total number S;

The priority of the preset hash packet number N and the total number of preset packets S is determined by the storage size of the TLB and the size of the cache.

The method according to claim 1 or 2, wherein the preset grouping rule comprises: a preset number of hash packets N, a preset number of packets m of each hash packet, and a total number of preset packets S; The value of N is determined by the storage size of the page table buffer TLB, which is greater than or equal to 1. Integer, m is less than N; the value of S is determined by the size of the database cache cache, which is a positive integer greater than or equal to 2;

When performing the hash grouping on the data segments in each of the target data, the packets are grouped according to the preset number of packets of each hash packet, so that the last number of packets is equal to the preset hash packet number, and the group of the group is divided. The total is equal to the total number of preset groups.

The method according to any one of claims 1 to 4, wherein the dividing each target data group into a plurality of data segments by using a vector vector is:

The method according to any one of claims 2 to 4, wherein each original data corresponding to a hash value having the same value on the specified bit position is divided into the same group, and is divided into the same group. Each raw data in the group is sorted and saved in the same group according to the position of each original data in the target data group, including: searching for each hash value corresponding to the same value in the specified bit position in the current hash grouping process The original data, the original data is divided into the same group, wherein the bit size required for the current hash group is specified according to the size of the database cache cache and the storage size of the page table buffer TLB; traversing each of the same group a subscript of the original data, the subscript of each of the original data is used to identify a location of each original data in the target data group;

The method according to any one of claims 2 to 4, wherein the current hash grouping process is performed based on unspecified bits in the last hash packet associated with and saved by the original data in the current group. Each raw data corresponding to the same hash value in the specified bit position is divided into the same group, and each original data divided in the same group is in the same position in the target data group according to each original data. Sorting and saving within the group includes: Invoking an unspecified bit in the last hash packet saved at each original data association location in the group currently performing the hash packet;

The method according to any one of claims 1 to 7, wherein the selecting, in order, the respective groups of the target data groups to be connected are obtained in each group obtained after N times of hash grouping. The raw data for Join operation includes:

Until all teams in the target data set perform traversal operations on each of the other target data groups.

9. A hash connection device, characterized by being applied to a database, comprising: a receiving unit, configured to receive a structured query language SQL statement including a connection Join operation, and parse and obtain at least two target data groups to be connected;

a grouping unit, configured to sequentially perform data segments in each target data group based on preset grouping rules

N times hash hash packet, wherein, in each hash packet, the hash value represented by the bit data obtained by calculating the original data in the data segment is calculated based on the first hash packet, and the specified bit bit in the current hash grouping process is performed. The raw data corresponding to the same hash value is divided into the same group, and each original data divided in the same group is sorted in the same group according to the position of each original data in the target data group. Save, N takes a positive integer greater than or equal to 1;

a sorting unit, configured to obtain a group obtained after N times hash grouping for each target data group, in which the hash values corresponding to the original data included in each group are performed from small to large for each group in the target data group Sort

The device according to claim 9, wherein the grouping unit comprises: a hash packet module for performing a first hash packet in a data segment in each target data group; and, after the last hash packet Obtaining the raw data in any one of the groups to perform the multiple hash packet module of the second to the nth hash group, where n is included in N, and a positive integer greater than 2 is taken;

The one-time hash grouping module is configured to calculate a hash value of the original data included in the current data segment, and use the bit digit to represent the calculated hash value; and the original data corresponding to the hash value with the same value in the specified bit position Divided into the same group, and sorted and saved the original data divided into the same group in the same group according to the position of each original data in the target data group; the hash value corresponding to each original data is The unspecified bit is associated with the original data and saved;

The multiple hash grouping module is configured to: use the unspecified bit in the last hash packet associated with and saved by the original data in the current group, and set the hash value with the same value in the specified bit position in the current hash grouping process. The corresponding original data is divided into the same group, and each original data divided in the same group is sorted and saved in the same group according to the position of each original data in the target data group; The remaining unspecified bits of the data association Save again.

The device according to claim 9 or 10, comprising:

When the preset grouping rule is a preset hash packet number N, the grouping unit is configured to perform hash grouping on data segments in each of the target data groups in sequence, until N times hash packets are completed; When the preset grouping rule is the preset total number S of packets, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of groups of each target data group is equal to a preset. Number of groups

When the preset grouping rule is the preset hash packet number N and the preset packet total number S, and the preset hash packet number N has a higher priority than the preset packet total number S, the grouping unit is used to sequentially The data segment in the target data group is hashed until the hash packet is completed N times; when the preset packet rule is the preset hash packet number N and the preset packet total number S, and the preset packet total S is prioritized When the level is higher than the preset hash packet number N, the grouping unit is configured to perform hash grouping on the data segments in each of the target data groups until the number of packets of each target data group is equal to a preset group. Total number S;

The value of N is determined by the storage size of the page table buffer TLB, which is a positive integer greater than or equal to 1, and N contains n; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2; The priority of the preset hash packet number N and the preset packet total S is determined by the storage size of the TLB and the size of the cache.

The device according to claim 9 or 10, comprising:

The value of N is determined by the storage size of the page table buffer TLB, and is a positive integer greater than or equal to 1, m is less than N; the value of S is determined by the size of the database cache cache, and is a positive integer greater than or equal to 2.

The device according to any one of claims 9 to 12, wherein the dividing unit comprises:

The device according to any one of claims 10 to 12, wherein the original data corresponding to the hash value having the same value in the specified bit position is divided into the same group, and is divided into Each of the raw data in the same group, the hash packet module that is sorted and saved in the same group according to the position of each original data in the target data group includes: a hash value represented by a bit;

The device according to any one of claims 10 to 12, wherein the bits that are not specified in the last hash packet associated and saved based on the original data in the current group are divided into the same Within the group, and for each raw data divided in the same group, the multiple hashes sorted and saved in the same group according to the position of each original data in the target data group The grouping module includes:

Determining a sub-module, configured to determine a bit bit used in the current hash packet process from the unspecified bit position of the call, where the bit bit required in the current hash packet process is based on a database cache cache The size and size of the page table buffer TLB storage;

a second traversal sub-module, configured to traverse a subscript of each original data divided in the same group, the subscript of each original data is used to identify a location of each original data in the target data group; a module, configured to arrange raw data corresponding to each subscript from small to large according to the size of each subscript;

The device according to any one of claims 9 to 15, wherein the connecting unit comprises:

The Join module includes:

A database management system, comprising: applying to a database, comprising: a memory having a storage medium, wherein the memory stores a program for performing a database query; and a processor connected to the memory through a bus, when executed When the database is queried, the processor invokes a database query program stored in the memory, and executes the database query program according to the method of any one of claims 1-8.