CN112597148A

CN112597148A - Data table connection method and device

Info

Publication number: CN112597148A
Application number: CN202011335852.8A
Authority: CN
Inventors: 李栋
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-04-02

Abstract

The invention discloses a data table connection method, which comprises the following steps: determining a first data table and a second data table to be connected; the first data table comprises a plurality of first records, and the second data table comprises a plurality of second records; determining a first key value key of each first record and a second key value key of each second record; pulling the first record and the second record of the matched first key and the second key from the original partition into one or more new partitions; concatenating the first record and the second record of the one or more new partitions.

Description

Data table connection method and device

Technical Field

The invention relates to a big data processing technology, in particular to a method and a device for connecting data tables.

Background

Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. The Shuffle process in the Spark calculation process needs to pull data from one Partition (Partition) to another Partition, and this process will generate network resource consumption, memory consumption, and consumption of disk IO (Input Output).

When the join class calculation of the two data tables is related, before the join operation is performed on the two tables, records in the two data tables need to be distributed into a plurality of partitions according to a key value (key) of each record in the two data tables, and the join operation is performed on the records belonging to the two tables in each partition, wherein the distribution of the records to the plurality of partitions relates to a Shuffle operation of migrating data from one partition to another partition.

If the difference between the keys in the two data tables is very large (for example, 50% of the keys in all records of table a do not exist in table B), the Join operation result for the record is null, but if the Shuffle operation needs to be performed on all records of the two data tables according to Join execution logic, the Shuffle operation corresponding to the record to which the Key with the larger difference belongs can be considered to be invalid. When there are more invalid Shuffle operations, the overall performance of the computation is greatly reduced.

Disclosure of Invention

The present disclosure provides a method for connecting data tables, so as to solve at least the above technical problems in the prior art.

One aspect of the present disclosure provides a method for connecting data tables, including:

determining a first data table and a second data table to be connected; the first data table comprises a plurality of first records, and the second data table comprises a plurality of second records;

determining a first key value key of each first record and a second key value key of each second record;

pulling the first record and the second record of the matched first key and the second key from the original partition into one or more new partitions;

concatenating the first record and the second record of the one or more new partitions.

Wherein the determining the first key of each of the first records and the second key of each of the second records comprises:

generating a first key according to the specific field in the first record;

and generating a second key according to the specific field in the second record.

Wherein the content of the first and second substances,

pulling the first record and the second record belonging to the matched first key and second key from the original partition into one or more new partitions, wherein the pulling comprises the following steps:

establishing one or more new partitions, and setting corresponding specific conditions for the one or more new partitions;

determining a first record of which a first key meets the specific condition, determining a second record of which a second key meets the specific condition, wherein the first key and the second key which meet the same specific condition are matched with each other;

the first record and the second record satisfying the same specific condition are pulled to the corresponding one or more new partitions.

And pulling the first record and the second record meeting the same specific condition to one or more corresponding new partitions based on a hash algorithm.

Wherein, the method also comprises:

if the original partition to which the matched first record belongs comprises other first records, pulling the other first records into one or more new partitions corresponding to the original partition;

if the original partition to which the matched second record belongs comprises other second records, pulling the other second records into one or more new partitions corresponding to the original partition;

and the one or more new partitions to which the matched first record and the second record belong, the one or more new partitions to which the other first records belong, and the one or more new partitions to which the other second records belong are different partitions.

Another aspect of the present disclosure provides a data table connection device, including:

the data storage module is used for determining a first data table and a second data table to be connected; the first data table comprises a plurality of first records, and the second data table comprises a plurality of second records;

the calculation module is used for determining a first key of each first record and a second key of each second record;

the pulling module is used for pulling the first record and the second record of the matched first key and the second key from the original partition into one or more new partitions;

and the connecting module is used for connecting the first record and the second record of the one or more new partitions.

Wherein the content of the first and second substances,

the calculation module is used for generating a first key according to the specific field in the first record; and the second key is also used for generating a second key according to the specific field in the second record.

Wherein, the device still includes:

the resource partitioning module is used for establishing one or more new partitions and setting corresponding specific conditions for the established one or more new partitions;

the computing module is further configured to determine a first record of which a first key meets the specific condition, determine a second record of which a second key meets the specific condition, and determine that the first key and the second key meeting the same specific condition are matched with each other;

the pulling module is further configured to pull the first record and the second record that satisfy the same specific condition to the corresponding one or more new partitions.

The pulling module is used for pulling the first record and the second record meeting the same specific condition to one or more corresponding new partitions based on a hash algorithm.

The pulling module is further configured to pull the other first records into one or more new partitions corresponding to the original partition when the original partition to which the matched first record belongs includes the other first records;

the pull module is further configured to pull the other second records into one or more new partitions corresponding to the original partition when the original partition to which the matched second record belongs includes the other second records;

In the scheme of the disclosure, a new partition is established, the first record and the second record meeting the specific conditions of the new partition are pulled into the new partition, and the pulling operation is not executed for the records in which all the records in the original partition do not meet the conditions of the new partition, so that the records stay in the original partition, thereby reducing the Shuffle operation, saving the network resource consumption, the memory consumption and the disk IO (Input Output) consumption in the Shuffle process, and simultaneously, only performing join calculation for the records meeting the new partition, thereby improving the calculation performance.

Drawings

FIG. 1 illustrates a flow diagram of a method for joining data tables, according to an embodiment;

FIG. 2 is a flow chart illustrating a method for linking data tables according to another embodiment;

FIG. 3 illustrates a diagram of a connection device structure for a data table according to one embodiment;

FIG. 4 is a diagram showing a configuration of a connection device of a data table according to another embodiment.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present disclosure provides a method for connecting data tables, as shown in fig. 1, including:

step 101, determining a first data table and a second data table to be connected; the first data table comprises a plurality of first records, and the second data table comprises a plurality of second records;

step 102, determining a first key of each first record and a second key of each second record;

step 103, pulling the first record and the second record of the matched first key and second key from the original partition into one or more new partitions;

step 104, concatenating the first record and the second record of the one or more new partitions.

In the embodiment of the present disclosure, for convenience of description, two pieces of data for join are denoted as a first data table and a second data table, each record in the first data table is referred to as a first record, each record in the second data table is referred to as a second record, a key of the first record is referred to as a first key, and a key of the second record is referred to as a second key.

In the above example of the present disclosure, the first record and the second record of the key matching are pulled into one or more new partitions, and for the key that cannot be matched, the record to which the key belongs does not perform the pull operation (Shuffle operation), so that the invalid Shuffle operation can be reduced.

The above-described scheme is explained below by way of specific examples.

Each table is an elastic Distributed data set (RDD) in Spark, and is mapped into a plurality of partitions (partitions), and the partitions between the tables are independent.

Assume that the first data table contains 100 first records (numbered 1-100) distributed as:

partition 1-1: first records 1-30;

partition 1-2: a first record 31-70;

partition 1-3: a first record 71-100;

the second data table contains 50 second records (numbered 1-50) distributed as:

partition 2-1: a second record 1-30;

partition 2-2: the second record 31-50.

Before the operation of pulling, first determining a first record and a second record which are matched, and assuming that the first record and the second record which are matched by key are: a first record 20-60 and a second record 31-40. Then, a pull operation is performed, specifically:

in this embodiment, the purpose of the pull operation is to pull the first records 20-60 and the second records 31-40 from the first data table and the second data table, when the pull operation is performed, if the first records 20-60 and other first records are in the same original partition, then other first records in the original partition also need to be pulled, and if only other first records exist in one original partition, then the first records in the original partition do not need to be pulled; similarly, if the second record 31-40 and other second records are in the same original partition, then the other second records in the original partition also need to be pulled, and if only other second records exist in the original partition, then the second record in the original partition does not need to be pulled. It should be noted that the new partition has a corresponding relationship with the original partition.

Assuming that after the pulling operation, the new partition condition corresponding to each record in the first data table and the second data table is as follows:

partition 3-1: first records 1-19; (New Partition3-1 corresponds to original Partition1-1)

Partition 3-2: a first record 61-70; (New Partition 3-2 corresponds to original Partition1-2)

Partition 3-3: a first record 71-100; (where Partition 3-3 is not a new Partition, but an original Partition, Partition1-3, where the records in Partition1-3 need not be pulled, but are only re-identified here)

Partition 4-1: a second record 1-30; (where Partition4-1 is not a new Partition, but an original Partition, Partition2-1, the records in original Partition, Partition2-1, need not be pulled, and are only re-identified here)

Partition 4-2: second records 41-50; (New Partition 4-2 corresponds to original Partition2-2)

Partition 5: a first record 20-60 and a second record 31-40. (the new Partition 5 corresponds to the original partitions Partition1-1, Partition1-2 and Partition2-2)

It should be noted that the number of the new partitions is only an example, and the number of the new partitions is determined by a preconfigured parameter, so according to the preconfigured parameter, the pulled first records 1-19, 61-70 in the first data table may be distributed in one new partition, or may be distributed in more new partitions, which is not limited in the present invention; similarly, the second records 41-50 pulled in the second data table may be distributed in one new partition or may be distributed in more new partitions. The matching first records 20-60 and second records 31-40 may also be distributed in one new partition or in more new partitions. However, the new partition in which the matching first records 20-60 and second records 31-40 are located is not the same partition as the new partition in which the other first records and other second records are located. The details will be described herein with reference to the following examples.

Then, when a join operation is performed: if the second record does not exist in the Partition3-1/3-2/3-3, directly outputting an empty Join result or outputting the content (namely, each record) of the Partition3-1/3-2/3-3 as a Join result according to the Join operation, and if the first record does not exist in the Partition4-1/4-2, directly outputting the empty Join result or outputting the content (namely, each record) of the Partition4-1/4-2 as the Join result according to the Join operation; and for the records in the Partition 5, performing join calculation according to the keys of the first record and the second record to obtain and output corresponding results.

Therefore, on the basis of ensuring the correctness of the join result, a large number of pulling operations can be reduced, and the calculation performance is improved.

The implementation of fig. 1 is explained in detail below:

first, the key is generated as follows:

generating a first key according to the specific field in the first record;

The specific field may be one field or a plurality of fields.

The key generation method of the first record and the key generation method of the second record are not limited in the embodiment of the disclosure, as long as the first key and the second key are generated in the same way.

After generating the key of each record, a pull operation may be performed, and then the step 103 pulls the first record and the second record to which the matched first key and second key belong from the original partition to one or more partitions, as shown in fig. 2, including:

step 201, establishing one or more new partitions, and setting corresponding specific conditions for the established one or more new partitions;

step 202, determining a first record of which a first key meets the specific condition, determining a second record of which a second key meets the specific condition, wherein the first key and the second key which meet the same specific condition are matched with each other;

step 203, the first record and the second record meeting the same specific condition are pulled to the corresponding one or more new partitions.

Here, a hash algorithm may be employed to pull the matching first and second records into the corresponding new partition or partitions.

In one example, the specific condition is key-related, e.g., for a numeric type of key, the specific condition may be configured by a distribution interval of values: the specific condition may be 0-1024, and then the first record and the second record whose keys belong to 0-1024 satisfy the specific condition; for example, for the character type key, hash operation may be performed on each key to obtain a numerical type key, and then a specific condition is configured according to a distribution interval of the numerical values, or a specific condition is configured according to a characteristic of the numerical values, for example, the specific condition is that a last numerical value is an odd number, or the specific condition is that a last numerical value is an even number; of course, for the character type key, the hash operation may not be performed, and the specific condition corresponding to the new partition may be configured according to the character characteristic.

Taking the example above, assuming that 2 new partitions A, B are created, the 2 new partitions corresponding to a particular condition, and that the first record 20-60 and the second record 31-40 satisfy the particular condition, then the first record 20-60 and the second record 31-40 are pulled from the original partition into the 2 new partitions A, B according to a hash algorithm.

It should be noted that the new partition established in step 201 is used to support pulling of the matched first record and second record, and the number is not limited, and may be one or multiple.

In addition, the specific condition may be multiple, such as the value ranges 0-1024 and 1025-2047, then 0-1024 may correspond to one or more partitions, 1025-2047 may correspond to one or more partitions, and the partitions corresponding to 0-1024 and 1025-2047 may be the same or different.

It should be noted that, in the process of pulling the matching first record and second record:

the one or more new partitions to which the matched first record and second record belong, the one or more new partitions to which other first records belong, and the one or new partitions to which other second records belong are different partitions.

And if the matched first record or second record does not exist in the original partition, the first record or second record in the original partition does not execute the pulling operation.

In addition, for the first record table:

since the first record 20-30 (satisfying the specific condition) and the first record 1-19 (not satisfying the specific condition) belong to the same original Partition1-1, a pull operation is also required for the first record 1-19;

the first record 31-60 (satisfying the specific condition) and the first record 61-70 (not satisfying the specific condition) belong to the same original Partition1-2, then the pull operation is also required to be performed on the first record 61-70;

thus, one or more new partitions may be established, and assuming that 1 new partition C is established, the first record 1-19, 61-70 is pulled into the 1 new partition according to the hash algorithm.

For the second record table:

since the second record 31-40 (satisfying the specific condition) and the second record 41-50 (not satisfying the specific condition) belong to the same original Partition2-2, a pull operation is also required for the second record 41-50;

thus, one or more new partitions may be created, and assuming 2 new partitions D, E are created, then the second record 41-50 is pulled D, E into the 2 new partitions according to the hash algorithm.

And for the original partitions Partition 3-3 and Partition4-1, there is no record that satisfies the above-described specific condition, and therefore, the pull operation is not performed.

At this point, the Shuffle operation process ends.

After the Shuffle operation process is finished, join operations are performed on the first record and the second record in one or more new partitions (created in step 201), and the result is output. For the unmatched first record or second record in the new partition (different from the new partition created in step 201, such as the new partition C, D, E described above), and the first record or second record in the original partition, the null join result can be directly output or the records in the partition can be directly output, because only the first record or only the second record is included in the partitions.

In the above example of the present disclosure, a new partition is established, a first record and a second record that meet a specific condition of the new partition are pulled into the new partition, and a pulling operation is not performed on a record that does not meet a condition of the new partition in an original partition, so that the record stays in the original partition, thereby reducing Shuffle operations, saving network resource consumption, memory consumption and consumption of a disk IO (Input Output) in a Shuffle process, and meanwhile, only performing join calculation on a record that meets a condition of the new partition, thereby improving calculation performance.

As shown in fig. 3, an example of the present disclosure provides a data table connection apparatus 30, including:

a data storage module 31, configured to determine a first data table and a second data table to be connected; the first data table comprises a plurality of first records, and the second data table comprises a plurality of second records;

a calculation module 32, configured to determine a first key of each of the first records and a second key of each of the second records;

the pulling module 33 is configured to pull the first record and the second record belonging to the matched first key and second key from the original partition into one or more new partitions;

a linking module 34, configured to link the first record and the second record of the one or more new partitions.

The calculation module 32 is configured to generate a first key according to a specific field in the first record; and the second key is also used for generating a second key according to the specific field in the second record.

As shown in fig. 4, the apparatus 30 further includes:

the resource partitioning module 35 is configured to establish one or more new partitions, and set corresponding specific conditions for the established one or more new partitions;

the calculation module 32 is further configured to determine a first record of which a first key satisfies the specific condition, determine a second record of which a second key satisfies the specific condition, and determine that the first key and the second key satisfying the same specific condition are matched with each other;

the pulling module 33 is further configured to pull the first record and the second record that satisfy the same specific condition into the corresponding one or more new partitions.

The pulling module 33 is configured to pull the first record and the second record that satisfy the same specific condition into the corresponding one or more new partitions.

The pulling module 33 is further configured to pull the first record and the second record that satisfy the same specific condition to the corresponding one or more new partitions based on a hash algorithm.

The pulling module 33 is further configured to, when the original partition to which the matched second record belongs includes other second records, pull the other second records to one or more new partitions corresponding to the original partition;

Illustratively, the present disclosure also provides an electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is used for reading the executable instruction from the memory and executing the instruction to realize the data table connection method.

Illustratively, the present invention also provides a computer-readable storage medium storing a computer program for executing the above-described data table linking method.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the methods according to the various embodiments of the present application described in the "exemplary methods" section of this specification, above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for connecting data tables comprises the following steps:

2. The method for linking data tables according to claim 1, wherein the determining the first key of each first record and the second key of each second record comprises:

generating a first key according to the specific field in the first record;

3. The method for linking data tables according to claim 1 or 2, wherein the step of pulling the first record and the second record belonging to the matched first key and second key from the original partition into one or more new partitions comprises the steps of:

4. The method for linking data tables according to claim 3,

5. The method of claim 4, further comprising:

6. A data table connection apparatus comprising:

7. The data sheet connecting device of claim 6,

8. The data table connecting device according to claim 6 or 7, further comprising:

9. The connection device of data sheet of claim 8,

and the pulling module is used for pulling the first record and the second record meeting the same specific condition to one or more corresponding new partitions based on a hash algorithm.

10. The connection device of data sheet of claim 9,

the pull module is further configured to pull the other first records into one or more new partitions corresponding to the original partition when the original partition to which the matched first record belongs includes the other first records;