CN115062027A

CN115062027A - Hash connection method, computing node, storage medium, and program product

Info

Publication number: CN115062027A
Application number: CN202210825987.5A
Authority: CN
Inventors: 吴宇昊; 李飞飞; 魏闯先
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-09-16

Abstract

The application provides a hash connection method, a computing node, a storage medium and a program product, wherein the hash connection method comprises the following steps: based on the first hash function, performing hash operation on the first data table to obtain a hash table; generating a bloom filter based on the hash table; filtering the second data table according to the bloom filter; and performing hash connection on the hash table and the filtered second data table to obtain and output a hash connection table. The data table is filtered through the dynamically generated bloom filter before the hash connection, so that the data volume of the data table is reduced, the data volume required to be transmitted and detected is reduced, the hash connection efficiency is improved, and the hash connection cost is reduced.

Description

Hash connection method, computing node, storage medium, and program product

Technical Field

The present application relates to the field of database technologies, and in particular, to a hash join method, a computing node, a storage medium, and a program product.

Background

Hash join (hash join) is a common operation for establishing relationships between multiple tables in a database. Hash connection constructs a hash table for one table, scans the tuple of the other table to be compared with the hash table, and thus detects the connection between the two tables.

For distributed databases, especially large-scale distributed databases such as data warehouses, since data is redistributed into multiple nodes, when hash connection is performed, tuples in a data table stored by each node need to be transmitted through a network, so that when the tuples and corresponding tuples in the hash table meet equivalent connection conditions, the tuples and the corresponding tuples in the hash table are connected and output. The network overhead of the hash connection is multiplied along with the increase of the cluster scale, so that the hash connection efficiency is low and the requirement cannot be met.

Disclosure of Invention

The application provides a hash connection method, a computing node, a storage medium and a program product, wherein tuples in a data table are filtered based on a dynamic bloom filter, so that the data volume needing to be transmitted and detected is reduced, and the hash connection efficiency is improved.

In a first aspect, the present application provides a hash join method, including:

based on the first hash function, performing hash operation on the first data table to obtain a hash table; generating a bloom filter based on the hash table; filtering the second data table according to the bloom filter; and performing hash connection on the hash table and the filtered second data table to obtain and output a hash connection table.

Optionally, generating a bloom filter based on the hash table includes:

generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node; acquiring a sub-bloom filter corresponding to each other computing node; and generating the bloom filter according to the sub bloom filter corresponding to each computing node.

Optionally, generating a bloom filter based on the hash table includes:

generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node; acquiring setting information of the sub-bloom filters corresponding to other computing nodes; generating the bloom filter according to the setting information and the sub bloom filter corresponding to the computing node; the setting information is used for describing the set bit of the corresponding sub-bloom filter.

Optionally, after generating the sub bloom filter corresponding to the computing node according to the hash table corresponding to the computing node, the method further includes:

respectively calculating the data amount required by transmitting the sub-bloom filter in the first transmission mode and the second transmission mode; determining a target transmission mode of the sub bloom filter from a first transmission mode and a second transmission mode according to the data volume required by transmission; broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes based on the target transmission mode; broadcasting the sub-bloom filter corresponding to the computing node based on the first transmission mode, wherein the broadcasting comprises the following steps: broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes; broadcasting the sub-bloom filter corresponding to the computing node based on a second transmission mode, wherein the sub-bloom filter comprises: broadcasting the setting information of the sub-bloom filter corresponding to the computing node to other computing nodes; the setting information is used for describing the position where the corresponding sub-bloom filter is set.

Optionally, generating a sub-bloom filter corresponding to the computing node according to the hash table includes:

obtaining at least one second hash function; calculating a second hash value of each tuple in the first data table according to at least one second hash function; constructing a sub-bloom filter corresponding to the computing node according to the first hash value and the second hash value; and the first hash value is a hash value in the hash table.

Optionally, the method further includes:

determining hash cost according to the first data table; and determining the number of hash functions of the sub-bloom filters corresponding to the computing node according to the hash cost and the connection cost, so as to obtain a corresponding number of second hash functions according to the number of the hash functions.

Optionally, filtering the second data table according to the bloom filter includes:

when the hash distribution of the second data table is different from that of the first data table, broadcasting the bloom filter to a process corresponding to the second data table; and filtering the tuple in the scanned second data table according to the bloom filter by the process corresponding to the second data table to obtain a filtered second data table, and sending the filtered second data table to the process corresponding to the first data table, so that the process corresponding to the first data table performs hash connection on the hash table and the filtered second data table.

In a second aspect, the present application provides a hash join apparatus, including:

the hash operation module is used for carrying out hash operation on the first data table based on a first hash function to obtain a hash table; a filter generation module for generating a bloom filter based on the hash table; the filtering module is used for filtering the second data table according to the bloom filter; and the Hash connection module is used for carrying out Hash connection on the Hash table and the filtered second data table to obtain and output a Hash connection table.

In a third aspect, the present application provides a computing node, comprising:

a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored by the memory to implement the method provided by the first aspect of the present application.

In a fourth aspect, the present application provides a distributed database comprising a plurality of computing nodes as provided in the third aspect of the present application.

In a fifth aspect, the present application provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method provided by the first aspect of the present application when executed by a processor.

In a sixth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, performs the method provided in the first aspect of the present application.

The application provides a hash connection method, a computing node, a storage medium and a program product, aiming at an application scenario that a connection relation between a first data table and a second data table is determined through hash connection, after a corresponding hash table is obtained by performing hash operation on the first data table based on a first hash function, dynamic construction of a bloom filter is performed based on the hash table, filtering of the second data table is performed based on the constructed bloom filter, hash connection is performed through the hash table and the filtered second data table, so that the connection between the first data table and the second data table is determined, and the hash connection table is obtained, so that subsequent data processing is performed based on the hash connection table, such as data statistics, filtering of the second data table is performed based on the dynamic bloom filter, the data volume in the second data table is reduced, and the transmission and detection overhead of the second data table is reduced, the efficiency of hash connection is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a hash join method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of step S203 in the embodiment of FIG. 2;

FIG. 4 is a flowchart of one implementation of step S202 in the embodiment of FIG. 2 of the present application;

FIG. 5 is a schematic diagram of the full transmission mode of the sub-bloom filter in the embodiment of FIG. 4;

FIG. 6 is a flowchart illustrating another implementation manner of step S202 in the embodiment shown in FIG. 2;

FIG. 7 is a diagram illustrating the transmission manner of the set bits of the sub-bloom filter in the embodiment of FIG. 6;

fig. 8 is a schematic flowchart of another hash join method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a compute node according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the concepts of the application by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

First, the noun terms referred to in the present application are explained:

tuple: is a basic concept in relational databases, where each row in a data table is a tuple.

Hash connection: table joining using a hash algorithm is an operator of a database for combining data in two or more tables according to a relationship between certain columns.

Bloom Filter (Bloom Filter): the method is used for searching whether an element exists in a set or not, the Hash values of the element are respectively calculated mainly through K Hash functions, the calculated K Hash values are mapped to K bits of a binary vector or a bit array corresponding to a bloom filter, and if at least one value on the K bits is 0, the element does not exist in the corresponding set; if the values of the K bits are all 1, it indicates that the element may exist in the corresponding set.

A distributed database is a logically identical database formed by connecting a plurality of physically dispersed database units by a computer network. Each connected database unit is called a node or a computing node, the distributed database includes at least two nodes, and the nodes may be physical nodes distributed in different places or logical nodes distributed in the same physical database.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application, and as shown in fig. 1, a distributed database is composed of a plurality of physical nodes, for example, nodes 1 to n in fig. 1, where one or more data tables are stored in the nodes 1 to n, and different data tables respectively store different data. In fig. 1, each node stores two data tables as an example, a data table 1-1 and a data table 2-1 are stored in the node 1, a data table 1-2 and a data table 2-2 are stored in the node 2, and so on, a data table 1-n and a data table 2-n are stored in the node n. Data tables 1-1 through 1-n are part of data table 1, and data tables 2-1 through 2-n are part of data table 2.

In some embodiments, some nodes may not store any data, or one node further stores the data table 3 or a part of the data table 3, and the application does not limit the data storage of each node.

For example, the data table 1 may be an order table including information of a member number, an order time, and order details, and the data table 2 may be a member table including information of a member number, a member name, and a member time.

The main process of the hash connection comprises a Build process and a Probe process, wherein in the Build process, a data table (such as a data table 1) with a small data amount in the hash connection is marked as a right table, and a hash value is calculated by adopting hash operation on the connection attribute of each tuple in the data table, so that a corresponding hash table is built. In the detection process, another data table with hash connection, namely a data table with a large data amount (such as the data table 2), is marked as a left table, each row of the left table is scanned, the hash value of the connection attribute is calculated, the hash value is compared with the hash table constructed in the establishing process, and records meeting the connection conditions are searched, so that the connection relation between the left table and the right table is determined.

Taking hash connection between the data table 1 (right table) and the data table 2 (left table) as an example, before probing, the data table 2 needs to be transmitted, that is, the data table 2-1 to the data table 2-n are transmitted through a network, so that the data table 2 and the hash table corresponding to the data table 1 are in the same process, so as to perform the above probing process.

For a distributed database, a left table and a right table may be stored in a plurality of physical nodes in a distributed manner, and when hash connection is performed, the left table needs to be transmitted through a network by adopting the hash connection mode, which causes that network overhead increases along with the increase of the scale of the distributed database, and hash connection efficiency is low.

When the left table and the right table are stored in the same node, the left table still needs to be transmitted to the hash join operator in a local transmission mode so as to perform hash join. The left table has a large data volume and large transmission overhead, so that the hash connection efficiency is low.

In order to improve the efficiency of hash connection, namely reduce the data volume transmitted in the hash connection process, the application provides a hash connection method based on a combined dynamic bloom filter, the main process of the method is based on a right table (i.e. a subsequent first data table) of hash connection, usually a data table with less data volume during hash connection, a dynamic bloom filter is constructed in the hash table generated in the construction stage, then the tuple in a left table (i.e. a subsequent second data table) of hash connection is filtered based on the dynamic bloom filter, thereby reducing the data volume of the left table, the filtered left table is sent to the physical node or process where the right table is located through a network, further the hash connection is carried out based on the hash table corresponding to the right table and the filtered left table, the transmission overhead of the left table is reduced, and the data volume detected during hash connection is reduced at the same time, and further, the hash connection efficiency of the data tables in the distributed database is improved.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a schematic flowchart of a hash join method according to an embodiment of the present application, where the method is applied to a computing node of a database.

In one embodiment, the database may be a distributed database, or may be a centralized database.

As shown in fig. 2, the hash join method includes the following steps:

step S201, based on the first hash function, performing hash operation on the first data table to obtain a hash table.

The first data table is input in the process of establishing the hash connection, namely one of the right tables, and is a data table needing to establish a corresponding hash table through the establishing process. The first data table may be stored in one or more nodes of a database. And carrying out hash operation on the corresponding first data table through one or more computing nodes to obtain a corresponding hash table.

The compute node may be a physical node or a logical node.

Specifically, for each computing node in which a first data table is stored, the stored first data table is scanned by the computing node, and then hash operation is performed on a tuple in the scanned first data table based on a first hash function, so as to obtain a corresponding hash table.

The hash table may be stored in a memory of the corresponding computing node. The first data table may be stored in a memory and/or a disk of the corresponding computing node.

The first hash function may be a hash function constructed based on any one of the manners, which is not limited in this application.

Step S202, generating a bloom filter based on the hash table.

Specifically, after the computing node generates the corresponding hash table, the computing node may further construct a corresponding bloom filter based on the hash table, so as to perform filtering on the second data table.

When there are a plurality of computing nodes, for the convenience of distinguishing, the bloom filter corresponding to each computing node is regarded as a sub-bloom filter. The computing node may further broadcast the constructed sub-bloom filters to other computing nodes to integrate the sub-bloom filters to obtain a complete bloom filter, so as to perform filtering on each second data table based on the complete bloom filter.

Specifically, the bloom filter may be set according to each first hash value in the hash table corresponding to the first sub-calculation node, so as to obtain the bloom filter or the sub-bloom filter corresponding to the first sub-calculation node.

The bloom filter (or the sub-bloom filter) is initially composed of bit arrays all equal to 0, the bit array corresponding to the bloom filter may be set based on each hash value (marked as a first hash value) in the hash table corresponding to the first data table, that is, the bit array corresponding to the first hash value is set, that is, the value thereon is set to 1, and the sub-bloom filter corresponding to the computing node may be obtained by traversing each first hash value in the hash table corresponding to the computing node.

The bloom filter is constructed by multiplexing the hash value in the hash table output in the hash connection establishing process, so that the overhead of constructing the bloom filter is reduced, and the efficiency of constructing the bloom filter is improved.

Because the hash table corresponding to the first sub-computation node is only performed based on the first hash function during computation, the number k of the hash functions corresponding to the bloom filter is only one, and the misjudgment rate is high. In order to reduce the misjudgment rate, at least one second hash function can be further obtained, and a second hash value of the tuple in the first data table is calculated based on the second hash function, so that the bloom filter is constructed based on the first hash value and the second hash value. The more the number of the second hash functions is, the more second hash values need to be calculated, so that the overhead of the bloom filter construction is increased, and the number of the second hash functions is determined by comprehensively considering two factors of the overhead and the misjudgment rate when the bloom filter is constructed.

And step S203, filtering the second data table according to the bloom filter.

And the second data table is input in the Hash connection detection process, namely one left table. The second data table may be stored in one or more nodes of the database.

In one embodiment, the first data table and the second data table may be stored in the same physical node.

In one embodiment, the first data table and the second data table may be stored in different physical nodes.

In one embodiment, each compute node may correspond to a set of first and second data tables.

And filtering the second data table of the computing node through each computing node according to the dynamically constructed bloom filter to delete the tuples of which the positions of the bloom filters mapped by the corresponding hash values in the second data table are not all 1, so that the data volume in the second data table is reduced, and the filtered second data table is obtained.

In one embodiment, the hash distributions of the first data table and the second data table are the same, that is, the first data table and the second data table are stored in the same slice (slice), that is, the first data table and the second data table are stored in the same process of the computing node, the constructed bloom filter may be directly transferred by the process to a scan operator of the second data table, and the scan operator filters tuples in the scanned second data table according to the bloom filter.

Exemplarily, fig. 3 is a schematic flowchart of step S203 in the embodiment shown in fig. 2 of the present application, as shown in fig. 3, a bloom filter is taken as an 8-bit array in fig. 3, and the number of hash functions of the bloom filter is one (that is, only the first hash function is taken as the hash function of the bloom filter), where the values of the 2 nd, 4 th, and 7 th bits of the bit array corresponding to the bloom filter constructed based on the hash table are 1, and the values of the remaining bits are 0. If a tuple in the second data table, such as x, the bit of the bit array corresponding to the hash value calculated by the first hash function is the 3 rd bit, and the value of the 3 rd bit of the bit array corresponding to the bloom filter is 0, that is, the tuple does not exist in the first data table, the tuple x is deleted. If one tuple in the second data table, such as y, the bit of the bit array corresponding to the hash value calculated by the first hash function is the 7 th bit, and the value of the 7 th bit of the bit array corresponding to the bloom filter is 1, that is, it indicates that the tuple may exist in the first data table, the tuple y is retained. And the like, so as to delete the tuples of which the values of the bits of the bit array of the bloom filter mapped in the second data table are not all 1.

Through the filtering operation, the tuple irrelevant to the first data table in the second data table is deleted, the data volume in the second data table is greatly reduced on the premise of not influencing Hash connection, the transmission efficiency of the second data table is improved, and meanwhile, the detection process in the Hash connection process is accelerated by reducing the data volume of the second data table.

And step S204, performing hash connection on the hash table and the filtered second data table to obtain and output a hash connection table.

And performing a hash connection detection process on the filtered second data table and a hash table obtained based on the hash operation of the first data table, so as to obtain a hash connection table describing the connection relationship between the first data table and the second data table.

Specifically, traversing each row or each record in the filtered second data table, and searching a record which meets the connection condition with the record from the hash table; if the hash table exists, the record and the record in the first data table in the hash table meeting the connection condition with the record are stored in the hash connection table, and the filtered second data table is traversed, so that the hash connection table corresponding to the first data table and the second data table is obtained.

In one embodiment, if the database is stored in multiple computing nodes, that is, there are multiple sets of the first data table and the second data table, the hash join tables generated by the computing nodes may be merged to obtain the final output hash join table.

Further, the hash connection table may be sent to the user terminal, so that the user can know the connection relationship between each group of the first data table and the second data table, and thus, subsequent data analysis processes such as data statistics and policy customization may be performed based on the connection relationship.

For example, the first data table may be an order table of the store a in a set time period (e.g., a day, a week, etc.), the second data table may be a customer table, and the hash connection table may be used to indicate a customer table of the order placed by the store a in the set time period, that is, the customers and their attributes corresponding to the respective orders in the set time period of the store a are determined by hash connection. Further, the customer label corresponding to each product in the store a may be identified based on the hash table, or the attributes of the customer who purchased each product in the store a may be counted.

The hash connection method provided by the application aims at an application scenario that the connection relation of a first data table and a second data table is determined through hash connection, hash operation is carried out on the first data table based on a first hash function to obtain a corresponding hash table, dynamic construction of a bloom filter is carried out based on the hash table, then filtration of the second data table is carried out based on the constructed bloom filter, hash connection is carried out on the second data table through the hash table and the filtered second data table, so that the relation between the first data table and the second data table is determined, and the hash connection table is obtained, so that subsequent data processing is carried out based on the hash connection table, such as data statistics, the data quantity in the second data table is reduced through filtration of the second data table based on the dynamic bloom filter, and the transmission and detection expenses of the second data table are reduced, the efficiency of hash connection is improved.

Fig. 4 is a flowchart of an implementation manner of step S202 in the embodiment shown in fig. 2 of the present application, where for a distributed database including a plurality of computing nodes, each computing node stores a group of first data table and second data table, and a transmission manner of a sub bloom filter provided in the present embodiment is a full transmission manner, as shown in fig. 4, step S202 may specifically include the following steps:

step S401, according to the hash table corresponding to the computing node, generating a sub-bloom filter corresponding to the computing node.

The hash table corresponding to the computing node is the hash table obtained by performing hash operation on the scanned first data table of the computing node based on the first hash function by the computing node.

And for each computing node, generating a sub-bloom filter corresponding to the computing node based on the corresponding hash table through the computing node. The computing node broadcasts the generated child bloom filters to other computing nodes.

Step S402, acquiring the sub-bloom filters corresponding to the other computing nodes.

Each computing node broadcasts the generated child bloom filters to other computing nodes, such that each computing node gets a complete bloom filter.

Step S403, generating the bloom filter according to the sub bloom filter corresponding to each of the computing nodes.

Specifically, for each compute node, the bit array corresponding to each sub-bloom filter is subjected to or operation or addition, so that a global and complete bloom filter can be obtained, and the second data table of each compute node is filtered based on the bloom filter.

Exemplarily, fig. 5 is a schematic diagram of a full transmission manner of the sub-bloom filter in the embodiment shown in fig. 4 of the present application, as shown in fig. 5, fig. 5 takes 2 computing nodes as an example, namely Seg1 and Seg2, and fig. 5 takes a bit array of the sub-bloom filter as an example, which is 11 bits. Each computing node, i.e., Seg1 and Seg2, constructs a corresponding child Bloom filter, i.e., Bloom1(01010010010) and Bloom2(10001010010), each computing node broadcasts the child Bloom filter constructed or generated by the computing node to other computing nodes, and each computing node adds the child Bloom filters, i.e., Bloom1+ Bloom2, to obtain a global and complete Bloom filter (11011010010).

By the way of the full transmission, the sub-bloom filters generated by the computing nodes are broadcasted to other computing nodes, so that each computing node obtains a global complete bloom filter, the second data table of the node is filtered based on the complete bloom filter, the filtering accuracy is improved, and the record with the connection relation with the record in the first data table is prevented from being deleted by mistake.

Fig. 6 is a flowchart of another implementation manner of step S202 in the embodiment shown in fig. 2 of the present application, where in this embodiment, for a distributed database including multiple computing nodes, each computing node stores a group of first data table and second data table, and a transmission manner of a sub-bloom filter in this embodiment is different from that of the sub-bloom filter in fig. 4, and as shown in fig. 6, the step S202 may specifically include the following steps:

step S601, generating a sub bloom filter corresponding to a computing node according to a hash table corresponding to the computing node.

Step S602, obtaining setting information of the sub bloom filter corresponding to each other computing node.

The set information is used to describe the bits of the corresponding sub-bloom filter that are set, that is, to describe each bit of the sub-bloom filter whose value is 1. For example, the setting information (2, 4, 8, 11) indicates that the 2 nd, 4 th, 8 th and 11 th positions of the bit array of the corresponding sub-bloom filter are set to 1, and the rest of the bits are set to 0.

And each computing node broadcasts the set information of the generated sub-bloom filters to other computing nodes, so that each computing node obtains a complete bloom filter.

To improve the rate of set information determination, the set information for each sub-bloom filter may be determined based on AVX instruction set accelerated computing techniques.

Step S603, generating the bloom filter according to the setting information and the sub bloom filter corresponding to the computing node.

Specifically, for each computing node, based on the setting information sent by other computing nodes, the sub-bloom filter generated or constructed by the node is set, that is, the value at the position corresponding to each setting information is set to 1, so that a global and complete bloom filter can be obtained, and the second data table of each computing node is filtered based on the bloom filter.

Further, the setting information of each other computing node may be merged to delete repeated bits in the setting information to obtain merged setting information, and the sub-bloom filter corresponding to the computing node is set or updated based on the merged setting information to obtain a global and complete bloom filter.

For example, taking the sub-bloom filter generated by the current compute node as "01101001", if the set information sent by the other two compute nodes is (2, 3) and (2, 4, 7), respectively, the combined set information is (2, 3, 4, 7), and after the sub-bloom filter is updated based on the combined set information, the obtained bloom filter is "01111011", that is, the values at the 4 th bit and the 7 th bit of the sub-bloom filter are set to 1.

In one embodiment, a "0" or a "1" may be used as the first bit of the bit array.

For example, fig. 7 is a schematic diagram of a setting transmission manner of the sub bloom filter in the embodiment shown in fig. 6 of the present application, as shown in fig. 7, fig. 7 takes 3 computing nodes as an example, namely Seg1 to Seg3, and a bit array of the sub bloom filter in fig. 7 is 8 bits. Each computing node broadcasts the setting information of the sub-bloom filter constructed or generated by the node to other computing nodes, and then each computing node obtains a global and complete bloom filter based on the setting information sent by other computing nodes and the sub-bloom filter generated by the node, the setting information of Seg1, Seg2 and Seg3 in fig. 7 are (1, 4), (4, 6) and (2, 4), respectively, fig. 7 only shows the process of Seg1 generating a complete bloom filter, specifically: seg1 sets the sub-bloom filter generated by Seg1 based on the setting information (4, 6) sent by Seg2 and the setting information (2, 4) sent by Seg2 to obtain a complete bloom filter, which is "01101010", wherein the first bit of the bit array is the 0 th bit. Likewise, the other compute nodes, Seg2 and Seg3, yield a completed bloom filter in a similar manner.

Through the setting transmission mode, the setting information of the sub-bloom filters generated by each computing node is only broadcasted to other computing nodes, so that each computing node obtains a global complete bloom filter.

Fig. 8 is a flowchart of another hash connection method provided in this embodiment, where this embodiment is directed to a scenario in which multiple groups of first data tables and second data tables need to be hash-connected, and each group of the first data tables and the second data tables stores one corresponding computing node, this embodiment is to further refine step S202 and step S203 based on the embodiment shown in fig. 2, and this method is applied to each computing node in a distributed database that stores one group of the first data tables and the second data tables.

As shown in fig. 8, the hash join method may include the steps of:

step S801, based on the first hash function, performs hash operation on the scanned first data table to obtain a hash table.

Step S802, generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node.

In order to improve the efficiency of generating the sub-bloom filters, the sub-bloom filters corresponding to the computing nodes may be generated based on only one hash function, that is, the first hash function, that is, based on only each first hash value in the hash table corresponding to the computing node.

In order to reduce the misjudgment rate of the sub-bloom filter and filter more data in the second data table, the sub-bloom filter may be constructed based on a plurality of hash functions, one of which is the first hash function.

Optionally, generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node includes:

obtaining at least one second hash function; calculating a second hash value of each tuple in the first data table according to at least one second hash function; and constructing a sub-bloom filter corresponding to the computing node according to the first hash value and the second hash value.

And the first hash value is the hash value in the hash table output in the establishing stage. The second hash function may be any hash function different from the first hash function.

In one embodiment, to improve efficiency, the at least one second hash function may be generated based on the first hash function.

Specifically, at least one second hash function may be obtained by adjusting an output domain, a mapping rule, and the like of the first hash function.

A new second hash function may be generated based on the generated second hash function and at least one of the first hash function.

Specifically, the plurality of second hash functions may be constructed in a linear manner by transferring different offset values to the first hash function.

Based on the second hash function, calculating a second hash value of each tuple in the first data table, and then based on the second hash value and the previously calculated first hash value (hash value in the hash table), constructing a sub-bloom filter corresponding to the calculation node, in a similar manner to step S202, only replacing the hash value according to the second hash value from "the first hash value" to "the first hash value and the second hash value".

Through the above manner, N +1 times of hash operation needs to be performed on each tuple in the first data table, where N is the number of the second hash functions, so as to obtain one first hash value and N second hash values corresponding to each tuple in the first data table, and then perform setting of the corresponding sub-bloom filter based on one first hash value and N second hash values corresponding to each tuple in the first data table, so as to obtain the sub-bloom filter corresponding to the computing node.

The more the number of the second hash functions is, the larger the overhead for constructing the sub-bloom filter is; the smaller the number of the second hash functions is, the higher the false positive rate of the sub bloom filter is, and in order to balance the false positive rate and the overhead, the number of the second hash functions needs to be set reasonably. The application also provides a method for determining the number of the second hash function, which specifically comprises the following steps:

determining a hash cost according to the first data table; determining a connection cost according to the second data table; and determining the number of hash functions of the sub-bloom filters corresponding to the computing node according to the hash cost and the connection cost, so as to obtain a corresponding number of second hash functions according to the number of the hash functions.

The hash cost is used for describing the overhead required by hash operation on each line or each tuple in the first data table. The connection cost is used for describing the overhead of the second data table in the hash connection if the second data table is not filtered.

Specifically, the hash cost may be determined based on the number of rows and the structure of the first data table. Based on the number of rows and the structure of the second data table, a connection cost is determined.

In one embodiment, the number of the second hash functions is in an inverse correlation relationship with the hash cost, and is in a positive correlation relationship with the connection cost.

Specifically, for each computing node corresponding to a group of the first data table and the second data table, before generating the child bloom filter of the computing node, the number of hash functions of the child bloom filter needs to be determined, which specifically includes: calculating the Hash cost based on the structure and the line number of the first data table corresponding to the calculation node, and calculating the connection cost based on the structure and the line number of the second data table corresponding to the calculation node; and then determining the number of the hash functions of the sub-bloom filters corresponding to the computing node based on the connection cost, the misjudgment rate and the hash cost.

In one embodiment, the number N of second hash functions of the sub-bloom filter may be calculated based on the following expression:

C(i)＝(p(i)-p(i+1))×C _join -i×C _hash

the value of each C (i) is calculated in steps of 1 starting from i ═ 1, and C (N) is the minimum value of C (i), that is, the number N of the second hash functions is selected so that C (i) takes the minimum value, that is, when i ═ N, C (N) is the minimum value of each calculated C (·).

Wherein, p (i) is the misjudgment rate of the sub-bloom filter when the number of the second hash functions is i, and i is a natural number; c _join At the connection cost mentioned above, C _hash The hash cost is as described above.

In one embodiment, an upper value of N may be set, such as 2, 3, 5, 7, 9, or other values.

In step S803, the data amounts required for transmitting the sub bloom filters in the first transmission method and the second transmission method are calculated, respectively.

The first transmission mode is the full transmission mode, and the second transmission mode is the setting transmission mode.

The data amount required by the first transmission mode and the second transmission mode can be determined based on the data distribution condition of the sub bloom filter.

Step S804, according to the data amount required for transmission, determining the target transmission mode of the sub bloom filter from the first transmission mode and the second transmission mode.

Specifically, the transmission mode with the minimum data size in the first transmission mode and the second transmission mode may be determined, and the transmission mode may be a target transmission mode, so as to reduce the overhead of the sub-bloom filter broadcast.

Step S805, if the target transmission mode is the first transmission mode, broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes.

Step S806, acquiring a sub bloom filter corresponding to each other computing node.

Each computing node related to the hash connection, that is, the computing node corresponding to one group of the first data table, only needs to generate the sub-bloom filter corresponding to the computing node based on the above manner, and then broadcasts the sub-bloom filter to other computing nodes corresponding to the first data table, so that each computing node obtains the sub-bloom filter corresponding to each computing node.

Step S807, generating the bloom filter according to the sub bloom filter corresponding to each of the calculation nodes. The process jumps to step S811.

Step S808, if the target transmission mode is the second transmission mode, broadcasting the setting information of the sub bloom filter corresponding to the computing node to other computing nodes.

Step S809, obtaining setting information of the sub bloom filters corresponding to the other computing nodes.

Step S810, generating the bloom filter according to the setting information and the sub bloom filter corresponding to the computing node.

After the complete bloom filter is obtained based on the branch corresponding to the first transmission method, i.e. step S805 to step S807, or based on the branch corresponding to the second transmission method, i.e. step S808 to step S810, the filtering of the second data table is performed based on the bloom filter, i.e. step S811 is performed.

And step S811, filtering the second data table according to the bloom filter.

When the hash distributions of the first data table and the second data table corresponding to the same computing node are different, because the operator corresponding to the hash operation of the first data table and the operator corresponding to the hash connection are in the same process after slicing, the second data table and the bloom filter need to be received and transmitted through the network. In order to reduce the data amount of the second data table during transmission, the second data table may be filtered by a bloom filter before the second data table is transceived.

Specifically, the bloom filter may be sent to the process where the second data table is located to filter the second data table, and then the filtered second data table is sent to the process corresponding to the first data table to perform hash connection with the hash table, so as to obtain the hash connection table.

Aiming at the condition that the hash distribution of the first data table is different from that of the second data table, the bloom filter is used for filtering the second data table before the second data table is received and transmitted, so that the data volume of network receiving and transmitting is reduced, and the receiving and transmitting efficiency of the second data table is improved.

Step S812, performing hash connection on the hash table and the filtered second data table to obtain and output a hash connection table corresponding to the computing node.

Specifically, the hash connection tables generated by the computing nodes may be sent to the same computing node, so that the computing nodes integrate the hash connection tables to obtain and output a hash connection result.

In an embodiment, the hash connection result may also be fed back to a target terminal, where the target terminal may be a user terminal, an online analysis terminal, or the like.

In this embodiment, for an application scenario in which multiple sets of first data tables and second data tables stored in multiple computing nodes in distributed data are subjected to hash connection, for each computing node, after performing hash operation on the first data table corresponding to the computing node based on a first hash function to obtain a corresponding hash table, dynamically constructing a sub bloom filter corresponding to the computing node based on the hash table, and selecting a transmission mode with a small data volume to broadcast the sub bloom filter to other required computing nodes based on a distribution situation, so that each computing node obtains a complete bloom filter to improve filtering accuracy; and then filtering the second data table of each computing node based on the complete bloom filter, and performing hash connection on the filtered second data table through the hash table, so as to determine the relation between the first data table and the second data table, namely obtaining a hash connection table.

An embodiment of the present application provides a hash connection apparatus, where the hash connection apparatus includes: the device comprises a hash operation module, a filter generation module, a filtering module and a hash connection module.

The hash operation module is used for carrying out hash operation on the scanned first data table based on a first hash function to obtain a hash table; a filter generation module for generating a bloom filter based on the hash table; the filtering module is used for filtering the second data table according to the bloom filter; and the Hash connection module is used for carrying out Hash connection on the Hash table and the filtered second data table to obtain and output a Hash connection table.

Optionally, the database is a distributed database, and includes a plurality of computing nodes, each computing node stores a first data table, and the filter generation module includes:

the sub-filter generating unit is used for generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node; the acquisition unit is used for acquiring the sub bloom filters corresponding to the other computing nodes; and the first filter generation unit is used for generating the bloom filters according to the sub-bloom filters corresponding to the calculation nodes.

the sub-filter generating unit is used for generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node; the setting information acquisition unit is used for acquiring setting information of the sub-bloom filters corresponding to other computing nodes; the second filter generation unit is used for generating the bloom filter according to the setting information and the sub bloom filter corresponding to the computing node; the setting information is used for describing the set bit of the corresponding sub-bloom filter.

Optionally, the apparatus further comprises:

the filter broadcasting module is used for respectively calculating the data volume required by the transmission of the sub-bloom filters in the first transmission mode and the second transmission mode after the sub-bloom filters corresponding to the calculation nodes are generated according to the hash table corresponding to the calculation nodes; determining a target transmission mode of the sub bloom filter from a first transmission mode and a second transmission mode according to the data volume required by transmission; broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes based on the target transmission mode; broadcasting the sub-bloom filter corresponding to the computing node based on the first transmission mode, wherein the broadcasting comprises the following steps: broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes; broadcasting the sub-bloom filter corresponding to the computing node based on a second transmission mode, comprising: broadcasting the setting information of the sub-bloom filter corresponding to the computing node to other computing nodes; the setting information is used for describing the position where the corresponding sub-bloom filter is set.

Optionally, the sub-filter generation unit is specifically configured to:

Optionally, the apparatus further comprises:

the hash function number determining module is used for determining hash cost according to the first data table; and determining the number of hash functions of the sub-bloom filters corresponding to the computing node according to the hash cost and the connection cost, so as to obtain a corresponding number of second hash functions according to the number of the hash functions.

Optionally, the filtering module is specifically configured to:

when the hash distribution of the second data table is different from that of the first data table, broadcasting the bloom filter to a process corresponding to the second data table; and filtering the tuple in the scanned second data table according to the bloom filter through the process corresponding to the second data table to obtain a filtered second data table, and sending the filtered second data table to the process corresponding to the first data table, so that the process corresponding to the first data table performs hash connection on the hash table and the filtered second data table.

The training device for the hash join model provided in the embodiment of the present application may be used to execute the technical solutions provided in any embodiments corresponding to fig. 2 to fig. 8, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 9 is a schematic structural diagram of a computing node provided in an embodiment of the present application, and as shown in fig. 9, the computing node provided in the embodiment includes:

at least one processor 910; and a memory 920 communicatively coupled to the at least one processor; wherein the memory 920 stores computer executable instructions; the at least one processor 910 executes computer-executable instructions stored by the memory to cause the electronic device to perform a method as provided by any of the preceding embodiments.

Alternatively, the memory 920 may be separate or integrated with the processor 910.

For the implementation principle and the technical effect of the electronic device provided by this embodiment, reference may be made to the foregoing embodiments, which are not described herein again.

The embodiment of the application also provides a distributed database which comprises a plurality of computing nodes. The computing nodes provided in the embodiment shown in fig. 9 are included in the plurality of computing nodes.

The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method provided by any one of the foregoing embodiments may be implemented.

The embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method provided in any of the foregoing embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element identified by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods provided in the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A hash join method, comprising:

based on the first hash function, performing hash operation on the first data table to obtain a hash table;

generating a bloom filter based on the hash table;

filtering the second data table according to the bloom filter;

and performing hash connection on the hash table and the filtered second data table to obtain and output a hash connection table.

2. The method of claim 1, wherein generating a bloom filter based on the hash table comprises:

generating a sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node;

acquiring a sub bloom filter corresponding to each other computing node;

and generating the bloom filter according to the sub bloom filter corresponding to each computing node.

3. The method of claim 1, wherein generating a bloom filter based on the hash table comprises:

acquiring setting information of the sub-bloom filters corresponding to other computing nodes;

generating the bloom filter according to the setting information and the sub bloom filter corresponding to the computing node;

the setting information is used for describing the set bit of the corresponding sub-bloom filter.

4. The method according to claim 2 or 3, wherein after generating the sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node, the method further comprises:

respectively calculating the data amount required by transmitting the sub-bloom filter in the first transmission mode and the second transmission mode;

determining a target transmission mode of the sub bloom filter from a first transmission mode and a second transmission mode according to the data volume required by transmission;

broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes based on the target transmission mode;

broadcasting the sub-bloom filter corresponding to the computing node based on the first transmission mode, wherein the broadcasting comprises the following steps:

broadcasting the sub-bloom filter corresponding to the computing node to other computing nodes;

broadcasting the sub-bloom filter corresponding to the computing node based on a second transmission mode, comprising:

broadcasting the setting information of the sub-bloom filter corresponding to the computing node to other computing nodes; the set information is used to describe the location where the corresponding child bloom filter is set.

5. The method according to claim 2 or 3, wherein generating the sub-bloom filter corresponding to the computing node according to the hash table corresponding to the computing node comprises:

obtaining at least one second hash function;

calculating a second hash value of each tuple in the first data table according to at least one second hash function;

constructing a sub-bloom filter corresponding to the computing node according to the first hash value and the second hash value;

wherein the first hash value is a hash value in the hash table.

6. The method of claim 5, further comprising:

determining a hash cost according to the first data table;

determining connection cost according to the second data table;

and determining the number of hash functions of the sub-bloom filters corresponding to the computing node according to the hash cost and the connection cost, so as to obtain a corresponding number of second hash functions according to the number of the hash functions.

7. The method of any of claims 1-3, wherein filtering the second data table according to the bloom filter comprises:

when the hash distribution of the second data table is different from that of the first data table, broadcasting the bloom filter to a process corresponding to the second data table;

and filtering the tuple in the scanned second data table according to the bloom filter by the process corresponding to the second data table to obtain a filtered second data table, and sending the filtered second data table to the process corresponding to the first data table, so that the process corresponding to the first data table performs hash connection on the hash table and the filtered second data table.

8. A computing node, comprising:

a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1-7.

9. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, perform the method of any one of claims 1-7.

10. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1-7.