CN113821541A

CN113821541A - Data skew processing method, apparatus, storage medium, and program product

Info

Publication number: CN113821541A
Application number: CN202111139049.1A
Authority: CN
Inventors: 魏秀利
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-21
Also published as: WO2023045295A1

Abstract

The embodiment of the invention provides a data skew processing method, equipment, a storage medium and a program product, wherein the method comprises the steps of inquiring a data table to be connected by an inquiry node in a logic execution plan to obtain a Key Key of data skew, modifying the logic execution plan according to the Key of data skew and a data skew strategy to enable a first sub-table to be connected corresponding to the Key of data skew obtained by splitting from the data table to be connected to carry out mapping end link Map Join when a Spark executes a physical execution plan, and generating the physical execution plan according to the modified logic execution plan. According to the method provided by the embodiment, the logic execution plan is modified, so that various problems caused by data inclination are avoided fundamentally, the scene limitation is reduced, the dependence on statistical information is avoided, and the comprehensiveness and the accuracy of data inclination processing are improved.

Description

Data skew processing method, apparatus, storage medium, and program product

Technical Field

Embodiments of the present invention relate to the field of computer technologies, and in particular, to a data skew processing method, device, storage medium, and program product.

Background

For a distributed cluster system, different nodes thereof are responsible for a range of data storage or data computation. Data is often scattered insufficiently, resulting in a situation where a large amount of data is concentrated on one or several service nodes, called data skew. Taking the distributed computing engine Spark as an example, when the Spark computing engine shuffles Shuffle, the same Key on each node needs to be pulled to a task on a certain node for processing, and the operation progress of the whole Spark job is determined by the task with the longest operation time, so that after data skew occurs to part of keys, the overall computing efficiency of Spark is reduced.

In the prior art, an Adaptive Query Execution (AQE) technology may be introduced from an engine kernel level, where the AQE automatically optimizes Query Execution using runtime statistical information for the data skew problem, dynamically discovers the amount of skewed data, and divides a skewed partition into smaller sub-partitions for processing.

However, in the process of implementing the present invention, the inventor finds that the prior art center has at least the following problems: optimization of data tilting by the AOE technology depends on the accuracy of statistical information, and only partial scenes, for example, only one linked Join in the Stage in the same Stage, are supported, which has limitations.

Disclosure of Invention

Embodiments of the present invention provide a data skew processing method, device, storage medium, and program product, so as to improve the comprehensiveness and accuracy of data skew processing.

In a first aspect, an embodiment of the present invention provides a data skew processing method, including:

inquiring a data table to be connected through an inquiry node in a logic execution plan to obtain a Key Key of data inclination;

modifying the logic execution plan according to the data skew Key and the data skew strategy, so that a first sub-table to be connected, corresponding to the data skew Key, obtained by splitting from the data table to be connected, performs mapping end linking Map Join when a distributed computing engine Spark executes a physical execution plan;

and generating the physical execution plan according to the modified logic execution plan so as to execute the physical execution plan through Spark.

In a second aspect, an embodiment of the present invention provides a data tilt processing apparatus, including:

the query module is used for querying the data table to be connected through a query node in the logic execution plan to obtain a Key Key of data inclination;

the modification module is used for modifying the logic execution plan according to the data skew Key and the data skew strategy, so that a first sub-table to be connected, corresponding to the data skew Key, obtained by splitting from the data table to be connected, performs mapping end linking Map Join when the distributed computing engine Spark executes a physical execution plan;

and the generating module is used for generating the physical execution plan according to the modified logic execution plan so as to execute the physical execution plan through Spark.

In a third aspect, an embodiment of the present invention provides a data tilt processing apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method as set forth in the first aspect above and in various possible designs of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method according to the first aspect and various possible designs of the first aspect are implemented.

In a fifth aspect, an embodiment of the present invention provides a computer program product, which includes a computer program that, when executed by a processor, implements the method as set forth in the first aspect and various possible designs of the first aspect.

In the method, a data skew processing method, a device, a storage medium, and a program product provided by this embodiment query a to-be-connected data table through a query node in a logic execution plan to obtain a Key for data skew, modify the logic execution plan according to the Key for data skew and a data skew policy, so that a first to-be-connected sub-table corresponding to the Key for data skew obtained by splitting from the to-be-connected data table is mapped to Map-end link Map Join when a distributed computing engine Spark executes a physical execution plan, and generate the physical execution plan according to the modified logic execution plan, so as to execute the physical execution plan through Spark. The data skew processing method provided by the embodiment optimizes the processing of the skew Key from Reduce Join to Map Join by modifying the logic execution plan, fundamentally avoids a plurality of problems caused by data skew, reduces scene limitation, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a linking operation performed on two data tables according to an embodiment of the present invention;

FIG. 2 is a schematic flow diagram of a chaining operation by a spallation gradient Key as provided in the prior art;

fig. 3 is a schematic flow chart illustrating a data skew processing method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a data skew processing method according to another embodiment of the present invention;

FIG. 5 is a flowchart illustrating a linking operation performed on two data tables according to another embodiment of the present invention;

FIG. 6 is a directed acyclic graph illustrating a linking operation performed on two data tables according to the prior art;

FIG. 7 is a directed acyclic graph illustrating a linking operation performed on two data tables according to another embodiment of the present invention;

FIG. 8 is a directed acyclic graph illustrating a linking operation performed on two data tables according to yet another embodiment of the present invention;

FIG. 9 is a flowchart illustrating a data skew processing method according to yet another embodiment of the present invention;

fig. 10 is a schematic structural diagram of a data tilt processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic hardware structure diagram of a data tilt processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For a distributed cluster system, different nodes thereof are responsible for a range of data storage or data computation. Usually, the amount of data is not sufficiently distributed, resulting in a large amount of data concentrated on one or several service nodes, which is called data skew.

When the distributed computing engine Spark shuffles Shuffle, the same Key on each node needs to be pulled to a task on a certain node for processing, for example, performing operations such as aggregation or Join linking according to the Key. At this time, if the data amount corresponding to a Key is particularly large, data skew occurs. For example, most keys correspond to 10 pieces of data, but an individual Key corresponds to 100 ten thousand pieces of data, most tasks may be allocated to only 10 pieces of data, and then the operation is finished in a few seconds; but an individual task may be assigned 100 thousand data and run for two hours. Therefore, the running progress of the whole Spark job is determined by the task with the longest running time.

FIG. 1 is a flowchart illustrating a linking operation performed on two data tables according to an embodiment of the present invention; as shown in fig. 1, table 1-1 shows the scores of the mathematic competitions of the students, where the first column of data includes student ID cards, the second column of data shows the scores of the mathematic competitions corresponding to different IDs, table 1-2 shows the scores of the english competitions of the students, where the first column of data includes student ID cards, and the second column of data shows the scores of the english competitions corresponding to different IDs, respectively, it can be found from the two tables that the number of the scores of the students whose IDs are 001 in table 1-1 is large, when the two tables are joined, each ID is equivalent to a Key, obviously, the data size corresponding to Key001 is large, and when a task corresponding to Key001 is processed, more time needs to be consumed compared with other keys. Therefore, it can be said that Key001 has data skew. Of course, the description is only given here for better understanding of the data skew, and the data amount determination condition corresponding to the skew Key in practical application may be set as needed.

In order to solve the above problem of data skew, two methods are generally adopted in the prior art.

One way is to do the processing from the application layer: the tilted Key is scattered through technologies such as Rand, namely a random suffix is added behind the tilted Key, and the original tilted data is scattered and cracked. As shown in FIG. 2, Key001 in tables 1-1 and 2-1 was subjected to bulk cracking by the Rand technique to obtain Key 001-1, 001-2 and 001-3 in tables 1-2 and 2-1. Tables 1-2 and 2-2 after bulk cracking are in accordance with Key after bulk cracking. However, in this way, on one hand, the original business logic is destroyed, which often complicates the simple problem, and on the other hand, once Fetch Failure occurs to perform data recalculation, taking Key001 in fig. 2 as an example, it is necessary to perform re-spalling, and 001-1 obtained by the last spalling may become 001-5 in the next spalling, so that the same data is allocated to different data partitions, which finally results in data duplication.

Another way is to do the processing from the engine kernel layer: the Spark kernel introduces an Adaptive Query Execution (AQE) technique, and the AQE automatically optimizes Query Execution using runtime statistics for the above data skew problem, dynamically discovers the amount of skewed data, and divides the skewed partitions into smaller sub-partitions for processing. However, by means of the method, on one hand, an AQE technology strongly depends on statistical information in operation, if the statistical information has inaccuracy, data skew misjudgment or omission can be caused, on the other hand, the AQE technology only supports a scene with only one Join in the same Stage, the scene with multi-table Join is not supported, and the optimization logic of AQE data skew cannot be triggered for a scene with a Shuffle after the Join, on the other hand, the data skew is processed based on the AQE, the self positioning is the skew control of Partition granularity, and if the skew data are generated by the same Mapper, the problem cannot be solved.

It can be seen that there are drawbacks to optimization from the application layer or to optimization based on the introduction of AQE techniques into the Spark kernel. Aiming at the technical problems, the inventor researches and discovers that the AQE technology belongs to the optimization of a Physical Plan of a Physical execution Plan, and the whole execution Plan of Spark SQL mainly comprises the following steps: the Logical Plan and the Physical Plan are obtained by converting the Logical Plan and the Physical Plan, that is, if the data skew is optimized from the Logical Plan, the problem of the data skew can be solved from the root, and the defects from the application layer or the Physical Plan are avoided. Based on this, the embodiment of the present invention provides a data skew processing method, which effectively partitions data into slices by modifying a logic execution plan, optimizes a skew key from Reduce Join to Map Join to improve data processing capability, fundamentally avoids many problems caused by data skew, reduces scene limitations, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 3 is a flowchart illustrating a data skew processing method according to an embodiment of the present invention. As shown in fig. 3, the method includes:

301. and inquiring the data table to be connected through an inquiry node in the logic execution plan to obtain the Key Key of the data inclination.

In this embodiment, the data tables to be connected may include at least two data tables, for example, a first data table and a second data table, the first data table and the second data table may be respectively queried by a query node in the logic execution plan, and the data tables may further include three or more data tables, such as a first data table, a second data table, and a third data table, and the first data table, the second data table, and the third data table may be sequentially queried in the query process. The number of the data tables to be connected and the query sequence are not limited in this embodiment.

Optionally, in some embodiments, the querying, by a query node in the logic execution plan, the to-be-connected data table to obtain a Key of the data skew includes: for each Key of the data table to be connected, comparing the data volume corresponding to the Key with a first preset threshold; and if the data volume corresponding to the Key is greater than the first preset threshold, determining the Key as a Key of data skew.

Specifically, in the process of querying each data table in the data tables to be connected through the query node in the logic execution plan, each Key may be queried, as shown in fig. 1, assuming that tables 1-1 and 1-2 are data tables to be connected, in an implementation manner, Key001 in table 1-1 may be queried first, then other keys such as Key002 and Key003 may be queried in sequence, then Key001 in table 2-1 may be queried, and then other keys such as Key002 and Key003 may be queried in sequence. In another implementation, the numbers corresponding to different keys may be sorted first to form a new sequence, and the corresponding keys are sequentially queried according to the sequence in the sequence from the larger number of the corresponding keys to the smaller number of the corresponding keys. The method is beneficial to finding the inclined Key as soon as possible, and if the number of the inclined Key is small to a certain degree, the inclined Key cannot be found, the query can be stopped in time, and the operation amount is saved. The specific query method can be determined according to actual needs, and this embodiment does not limit this.

In the process of querying each Key, the data size corresponding to the Key may be compared with a first preset threshold. And when the data volume corresponding to the Key is larger than a first preset threshold value, determining the Key as an inclined Key. The amount of data here may be the size of the capacity of data, for example, how many megabits, how many G, or may be the number of pieces. The first preset threshold may be a fixed value determined empirically.

302. And modifying the logic execution plan according to the data skew Key and the data skew strategy, so that the first to-be-connected sub-table corresponding to the data skew Key obtained by splitting from the to-be-connected data table is subjected to mapping end linking Map Join when the distributed computing engine Spark executes a physical execution plan.

In this embodiment, the logic execution plan is modified based on the data skew Key and the data skew policy, so that the first to-be-connected word table corresponding to the data skew Key obtained by splitting from the to-be-connected data table can be connected at the mapping end when the subsequent physical execution plan is executed by the Spark calculation engine.

The Map Join at the mapping end and the Reduce Join at the merging end are used for performing Join operation on the data table to be connected. I.e. a merging operation of different data source data. Wherein, Reduce Join is the mark that finishes the data in the Map stage, finishes the amalgamation of the data in the Reduce stage. Map Join completes the merging of data directly in the Map phase, and has no Reduce phase.

Specifically, as shown in table 1-1 and table 2-1 shown in fig. 1, taking this as an example, to illustrate the Reduce Join process, in the Map phase of the Reduce Join, input data is uniformly packaged into a Bean, the Bean includes all public and non-public attributes of table 1-1 and table 2-1, which is equivalent to performing all-external connection, and an attribute, a file name, is newly added to distinguish whether the data comes from table 1-1 or table 2-1, so as to facilitate processing of the data in the Reduce phase; key output by Map is student ID, Value is Bean. In the Shuffle stage, the beans are sequenced according to the IDs, and all data with the same ID are aggregated under the same key and sent to the same Reduce task; in Reduce, for all beans under the same ID, it is first to distinguish whether the source is Table 1-1 or Table 2-1. If Map Join is performed on the table 1-1 and the table 2-1, no Reduce process exists, all work is completed in the Map stage, and the cost of network transmission and input and output is greatly reduced. The specific implementation process comprises the following steps: the data of table 1-1 or table 2-1, for example, table 2-1 may be cached in each Map task node in advance, and then when the data of table 1-1 is transmitted, the data of table 2-1 that has been prestored may be directly connected with the data of table 1-1 and output.

In this embodiment, taking tables 1-1 and 2-1 in fig. 1 as examples, if the Key of the data skew is 001, the first sub-table to be connected includes a first sub-table corresponding to 001 from table 1-1 and a second sub-table corresponding to 001 from table 2-1. The first sub table to be connected corresponding to the Key of the data skew obtained by splitting the data table to be connected performs Map-end linking Map Join when the distributed computing engine Spark executes a physical execution plan, may be Map Join performed on the first sub table and the second sub table when Sark executes the physical execution plan, specifically, the first sub table or the second sub table, for example, the second sub table is cached in each Map task node in advance, and then when the data of the first sub table is transmitted, the data of the first sub table is directly used to connect the pre-stored data of the second sub table and output the data.

303. And generating the physical execution plan according to the modified logic execution plan so as to execute the physical execution plan through Spark.

In this embodiment, the execution of API in the code by Spark mainly includes the following steps: firstly, writing a DataFrame/Dataset/SQL code; secondly, if the written codes have no errors, Spark converts the codes into a logic execution plan; thirdly, Spark can perform a series of optimization on the generated logic execution plan, and then convert the optimized logic execution plan into a physical execution plan; finally, Spark executes the physical execution plan, i.e., performs a series of operations on the elastic Distributed Data Sets (RDDs) on the cluster.

In this embodiment, after the logic execution plan is modified according to the Key of the data skew and the data skew policy, the physical execution plan may be generated based on the modified logic execution plan. Of course, the logic execution plan to be modified may be a logic execution plan after optimization or a logic execution plan before optimization, which is not limited in this embodiment.

In the data skew processing method provided in this embodiment, a data table to be connected is queried through a query node in a logic execution plan, a Key for data skew is obtained, the logic execution plan is modified according to the Key for data skew and a data skew policy, so that a first sub-table to be connected corresponding to the Key for data skew, which is obtained by splitting from the data table to be connected, performs mapping end linking Map Join when a distributed computing engine Spark executes a physical execution plan, and the physical execution plan is generated according to the modified logic execution plan, so that the physical execution plan is executed through Spark. The data skew processing method provided by the embodiment optimizes the processing of the skew Key from Reduce Join to Map Join by modifying the logic execution plan, fundamentally avoids a plurality of problems caused by data skew, reduces scene limitation, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.

Fig. 4 is a flowchart illustrating a data skew processing method according to another embodiment of the present invention. As shown in fig. 4, on the basis of the above-mentioned embodiment, for example, on the basis of the embodiment shown in fig. 3, the embodiment describes in detail how to modify the logic modification plan, and the method includes:

401. and inquiring the data table to be connected through an inquiry node in the logic execution plan to obtain the Key Key of the data inclination.

In this embodiment, step 401 is similar to step 301 in the above embodiment, and is not described herein again.

402. And adding a Key according to the data skew to the logic execution plan, splitting the data of the data table to be connected to obtain a first sub-table to be connected and a second sub-table to be connected corresponding to the Key without data skew, so that the first sub-table to be connected corresponding to the Key with data skew obtained by splitting the data table to be connected performs mapping end linking Map Join when the distributed computing engine Spark executes the physical execution plan.

403. And adding a processing step of merging the connected first sub-table to be connected and the connected second sub-table to be connected in the logic execution plan to obtain a final data table, so as to obtain the modified logic execution plan.

In some embodiments, the data tables to be connected include a first data table and a second data table, and the data skew Key is from the first data table and/or the second data table. That is, any one of the data tables to be connected may be determined to have a Key of data skew. As shown in FIG. 1, only one of the two data tables has a data-skewed Key, i.e., Key001 in Table 1-1. And as shown in FIG. 5, there is a data skew Key in both tables, i.e., the data skew Key in Table 1-1 is 001 and the data skew Key in Table 4-1 is 003. In this embodiment, the source and the number of keys for data skew are not limited.

In order to more visually modify the logic execution plan in the present embodiment, the connection procedure of the data tables to be connected 1-1 and 4-1 is illustrated below with reference to fig. 5.

As shown in FIG. 5, the Key for data skew in Table 1-1 is 001, and the Key for data skew in Table 4-1 is 003. Therefore, after the table 1-1 is split, the table 1-1-1 including the data corresponding to the data skew

keys

001 and 003, and the table 1-1-2 including the data non-skew Key002 are obtained; after the table 4-1 is split, the table 4-1-1 including the data corresponding to the data skew

keys

001 and 003 and the table 4-1-2 including the data non-skew Key002 are obtained, that is, after the tables 1-1 and 4-1 of the data to be connected are split according to the data skew keys, the first sub-table to be connected composed of the tables 1-1-1 and 4-1-1 and the second sub-table to be connected composed of the tables 1-1-2 and 4-1-2 are obtained. Tables 1-1 and 4-1 in FIG. 5 are for illustration only, and only 001 to 003,3 keys are shown for teaching the data table join process. In an actual data table, the data amount of the Key may reach tens of thousands or tens of millions, that is, the data is not skewed, the Key may be many, and the duty ratio of the data skew Key may be very small, in this case, each of tables 1-1-1 and 4-1-1 in the first to-be-connected sub-table may be used as a small table to execute Map Join when Spark executes the physical execution plan, and each of tables 1-1-2 and 4-1-2 in the second to-be-connected sub-table may be used as a large table to execute Reduce Join when Spark executes the physical execution plan. The implementation process of Map Join and Reduce Join may refer to the description of step 302, and is not described herein again.

After Map Join was performed on the first to-be-connected seed table 1-1-1 and table 4-1-1, table 5-1 was obtained, and after Reduce Join was performed on the second to-be-connected seed table 1-1-2 and table 4-1-2, table 5-2 was obtained, and after table 5-1 and table 5-2 were combined, the final data table 5 was obtained. In some embodiments, the merging the connected first sub-table to be connected and the connected second sub-table to be connected includes: and combining the connected first sub-table to be connected and the connected second sub-table to be connected through a Union operator to obtain a final data table.

In some embodiments, the second to-be-connected sub-table Spark performs merge-end Join when executing the physical execution plan. Optionally, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort merge link SortMergeJoin.

In practical applications, there are various ways of Map Join of the first to-be-connected sub-table, and after modifying the logic execution plan in this embodiment with reference to fig. 7 to 8, the way of Map Join of the first to-be-connected sub-table when Spark executes the physical execution plan is illustrated, and with reference to fig. 6, the comparison of advantages before and after the data skew processing method provided by this embodiment is used is illustrated.

In order to ensure that the first to-be-connected sub-tables corresponding to the data-skewed keys can all realize Map Join, in some embodiments, the number of the data-skewed keys is multiple, the number of the first to-be-connected sub-tables is multiple, and the data-skewed keys are in one-to-one correspondence with the first to-be-connected sub-tables; the first to-be-connected sub-table corresponding to the data-skewed Key obtained by splitting the to-be-connected data table performs Map Join when Spark executes a physical execution plan, and the Map Join method includes: for each data-skewed Key, the first to-be-connected sub-table corresponding to the data-skewed Key performs Map Join when Spark executes a physical execution plan. As shown in fig. 8, taking the to-be-connected data tables 1-1-1 and 4-1-1 in fig. 5 as examples including two data skew keys, each data skew Key corresponds to one first to-be-connected sub-table, that is, when table 1-1 is split, two first to-be-connected sub-tables can be split, the first to-be-connected sub-table only includes the data of Key001, and the other to-be-connected sub-table only includes the data of Key 002. In the subsequent Map Join process, as shown in fig. 8, Stage1 and Stage2 are directed to one first to-be-connected sub-table, and Stage3 and Stage4 are directed to another first to-be-connected sub-table, specifically, Map joins of data-skewed keys are implemented by broadcasthashjoins. In the directed acyclic graph shown in fig. 8, stages 5 and 6 are Reduce joins for the second to-be-connected sub-table, and after the processing of stages 5 and 6, the reducer joins for the non-skewed Key of the data is realized by SortMergeJoin. In Stage7, merging Union of two first to-be-connected sub-tables with BroadcastHashJoin executed and one second to-be-connected sub-table with sortmergergejoin executed to obtain a final data table.

In order to be able to reduce the amount of computation of the merge operation and reduce the number of compute nodes for processing the first to-be-connected sub-table, in some embodiments, the data-skewed Key is plural, the first to-be-connected sub-table being at least one; the modifying the logic execution plan according to the data skew Key and the data skew policy further includes: grouping the data-skewed keys to obtain a plurality of groups; the total data volume of each packet is smaller than a second preset threshold; each group corresponds to one first to-be-connected sub-table; the first to-be-connected sub-table corresponding to the data-skewed Key obtained by splitting the to-be-connected data table performs Map Join when Spark executes a physical execution plan, and the Map Join method includes: for each packet, the first to-be-connected sub-table corresponding to the packet performs Map Join when Spark executes a physical execution plan. Taking the data tables to be connected table 1-1-1 and table 4-1-1 in fig. 5 as an example that include two data skew keys, data skew

keys

001 and 003 can be grouped based on a second preset threshold, for example, assuming that the sum of the data amounts corresponding to 001 and 003 is less than the second preset threshold, 001 and 003 can be grouped into one group as shown in fig. 5, and the data tables to be connected have and only have the one group. As shown in fig. 7, for the one packet, a first to-be-connected sub-table may be obtained, and for the one to-be-connected sub-table, Map Join of the first to-be-connected sub-table may be implemented by performing BroadcastHashJoin after performing processing of Stage1 and Stage 2. In the directed acyclic graph shown in fig. 7, stages 3 and 4 are Reduce joins for the second to-be-connected sub-table, and after the processing of stages 3 and 4, the reducer joins for the non-skewed Key of the data is realized by SortMergeJoin. In Stage5, the first to-be-connected sub-table with BroadcastHashJoin executed and the second to-be-connected sub-table with sortmergergejoin executed are merged into Union to obtain the final data table. In this embodiment, the setting of the second preset threshold may be set empirically, which is not limited in this embodiment.

It is understood that if there are multiple data-skewed keys in the data table to be connected, for example, 100 data-skewed keys, the 100 data-skewed keys may be grouped based on the second preset threshold in various ways. In an implementation manner, keys with data skew can be sorted according to the number values of keys, then the data quantity of a first Key in the sorting is judged based on a second preset threshold, if the data quantity of the first Key is smaller than the second preset threshold, the data total quantity of the first Key and the second Key is judged based on the second preset threshold, if the data quantity of the first Key is still smaller than the second preset threshold, the data total quantity of the first Key, the second Key and the third Key is judged based on the second preset threshold until the data quantity exceeds the second preset threshold, the N-1 th previous keys are classified into one group, and the judgment of the steps is continued from the nth. Until all keys in the sequence are traversed. In another implementation, the keys for data skew can be sorted according to the amount of data to which each Key corresponds. And then grouping the keys in the sequence based on a second preset threshold. This is not limited in this embodiment, and may be selected according to actual needs.

Fig. 6 is a directed acyclic graph of two data tables provided in the prior art for performing a linking operation. Fig. 7 is a directed acyclic graph for performing a linking operation on two data tables according to another embodiment of the present invention. Fig. 8 is a directed acyclic graph for performing a linking operation on two data tables according to another embodiment of the present invention. As shown in fig. 6, each of Stage 1Stage1 and Stage 2Stage2 includes the following steps in sequence: range, Project, Exchange, custom read customshufflereder, and Sort; stage 3Stage3 includes a sort merge link, SortMergeJoin. As shown in fig. 7, Stage 1Stage1 includes the following steps: range, Filter, Project, Exchange, custom read customshufflereder, and broadcast Exchange broadcastexchange; stage 2Stage2 includes the following steps: range, Filter, Project, Exchange, custom read customshuffleader; stage 3Stage3 and Stage 4Stage4 each include Range, Filter, Project, Exchange, custom read customshuffleader, and Sort; stage5 comprises the following steps: broadcast table link BroadcastHashJoin, sort merge link SortMergeJoin, merge Union, and adaptive Spark execution plan AdaptiveSparkPlan.

Fig. 6 to 8 are further illustrated below by taking table 1-1 and table 4-1 shown in fig. 5 as an example, and as shown in fig. 6, two data tables are linked, typically, table 1-1 is processed by Stage1, and table 4-1 is processed by Stage2, and then, sortMergeJoin is performed at Stage3 to realize Reduce Join. As shown in fig. 7, Stage3 is performed on table 1-1-2, after Stage4 is performed on table 4-1-1, SortMergeJoin in Stage5 is performed to realize reduciejoin to obtain table 5-2, Stage1 is performed on one of table 1-1-1 and table 4-1-2, after Stage2 is performed on the other table, BroadcastHashJoin in Stage5 is performed to realize MapReduce to obtain table 5-1, and finally Union is performed on table 5-1 and table 5-2 to execute the adaptive Spark execution plan. Fig. 8 is a diagram, which is compared with fig. 7, that is, when the number of the data skew keys is too large, the data skewed data is split into multiple sets of sub-tables, that is, when the number of the data skew keys is too large, the table 1-1-1 is further subjected to refinement and splitting to obtain multiple next-level sub-tables, the corresponding table 4-1-2 is also split to obtain multiple next-level sub-tables, and then, for each set of next-level sub-tables (including one next-level sub-table of the table 1-1-1 and the next-level sub-table of the table 4-1-2 corresponding to the next-level sub-table), the BroadcastHashJoin is performed after the processing of Stage1 and Stage2, and multiple mapjoins are realized. Finally, the table obtained by the plurality of BroadcastHashJoin and the table obtained by the SortMergeJoin are merged by the processing of Stage7, and then the adaptive Spark execution plan is executed.

Before the embodiment of the present invention is adopted, that is, before the optimization, in the processing mode shown in fig. 6, the spare task running time is 5.5 minutes, and after the embodiment of the present invention is adopted, that is, after the optimization, in the processing mode shown in fig. 7 or fig. 8, the spare task running time is 2.1 minutes, and the overall performance is improved by 60%.

404. And generating the physical execution plan according to the modified logic execution plan so as to execute the physical execution plan through Spark.

Step 404 in this embodiment is similar to step 303 in the above embodiment, and is not described here again.

According to the data skew processing method provided by the embodiment of the invention, the logic execution plan is modified, and the data skew Key and the non-data skew Key are split, so that when Spark executes the physical execution plan, the first to-be-connected sub-table corresponding to the data skew Key can be subjected to Map Join, and the operation time is greatly reduced. The method avoids the influence of the overlong Key running time of data inclination on the running efficiency of the whole Spark task. And the method has no scene limitation, and solves the problem of data inclination from the root.

Fig. 9 is a flowchart illustrating a data skew processing method according to yet another embodiment of the present invention. As shown in fig. 9, on the basis of the above-described embodiment, for example, on the basis of the embodiment shown in fig. 3, the generation process of the logical execution plan and the generation process of the physical execution plan from the modified logical execution plan are described in detail in this embodiment. The method comprises the following steps:

901. and analyzing the Structured Query Language (SQL) text into a syntax tree to generate an unresolved logic execution plan.

902. And analyzing the unresolved logic execution plan to obtain a logic execution plan.

903. Adding the query node to the logic execution plan.

In this embodiment, the logic execution plan is mainly a series of abstract conversion processes. And does not involve an actuator or driver, it simply transforms the user's set of expressions into an optimal version. Specifically, the user's code is first converted into an Unresolved logic execution Plan (Unresolved Logical Plan), which is called Unresolved because the Unresolved logic execution Plan is not necessarily correct, and a table name or a column name referred to by the Unresolved logic execution Plan may or may not exist. Spark is followed by a directory Catalog, a metadata repository containing all data table tables and data frames, to resolve the table or column names referenced by the collation in a parser (analyst). If the unresolved logic execution Plan passes the verification, a Resolved logic execution Plan (Resolved Logical Plan) may be obtained. And adding a Query Node (Query Node) in the logic execution plan so as to Query the data table to be connected through the Query Node.

904. And inquiring the data table to be connected through an inquiry node in the logic execution plan to obtain the Key Key of the data inclination.

905. And modifying the logic execution plan according to the data skew Key and the data skew strategy, so that the first to-be-connected sub-table corresponding to the data skew Key obtained by splitting from the to-be-connected data table is subjected to mapping end linking Map Join when the distributed computing engine Spark executes a physical execution plan.

Step 904 and step 905 in this embodiment are similar to step 301 and step 302 in the above embodiment, and are not described again here.

906. And updating the modified logic execution plan to obtain an updated logic execution plan.

907. And optimizing the updated logic execution plan to obtain an optimized logic execution plan.

908. And converting the optimized logic execution plan into a physical execution plan.

Specifically, after the logic execution plan is modified, the logic execution plan needs to be updated, so that the updated logic execution plan is applied in the subsequent step. The modified logic execution plan is transferred to the Optimizer, Catalyst Optimizer for optimization, and then optimized through a series of optimization to generate the optimized logic execution plan. Spark converts this logical execution plan into a physical execution plan, checks for feasible optimization strategies, and checks for optimizations in the process. The physical execution plan determines how to execute the logical plan on the cluster by generating different physical execution operations and performing comparative analysis through the cost model. Spark when selecting a physical plan, Spark runs all code on Spark's underlying programming interface RDD. Spark performs further optimizations at runtime, generates native Java bytecodes that can optimize tasks or phases during execution, and finally returns the results to the user.

According to the data skew processing method provided by the embodiment of the invention, the query node is added before the optimization of the logic execution plan, and the queried data skew Key is separated from the data non-skew Key, so that when Spark executes the physical execution plan, the split data skew Key can be subjected to Map Join corresponding to the first sub-table to be connected, and the running time can be greatly saved. The method avoids a plurality of problems caused by data inclination from the root, reduces scene limitation, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data inclination processing.

Fig. 10 is a schematic structural diagram of a data tilt processing apparatus according to an embodiment of the present invention. As shown in fig. 10, the data skew processing apparatus 100 includes: a query module 1001, a modification module 1002, and a generation module 1003.

The query module 1001 is configured to query the to-be-connected data table through a query node in the logic execution plan, and obtain a Key of data skew.

A modifying module 1002, configured to modify the logic execution plan according to the data skew Key and the data skew policy, so that the first to-be-connected sub-table corresponding to the data skew Key obtained by splitting from the to-be-connected data table performs mapping end linking Map Join when the distributed computing engine Spark executes a physical execution plan.

A generating module 1003, configured to generate the physical execution plan according to the modified logic execution plan, so as to execute the physical execution plan through Spark.

The data tilt processing device provided by the embodiment of the invention,

in one possible design, the query module is specifically configured to:

and aiming at each Key of the data table to be connected, comparing the data volume corresponding to the Key with a first preset threshold value.

And if the data volume corresponding to the Key is greater than the first preset threshold, determining the Key as a Key of data skew.

In one possible design, the modification module is specifically configured to:

adding the following processing steps to the logic execution plan:

and splitting the data of the data table to be connected according to the Key of the data inclination to obtain a first sub table to be connected and a second sub table to be connected corresponding to the Key of the data non-inclination.

And merging the connected first sub-table to be connected and the connected second sub-table to be connected to obtain a final data table.

In one possible design, the modification module is specifically configured to:

and combining the connected first sub-table to be connected and the connected second sub-table to be connected through a Union operator to obtain a final data table.

In one possible design, the data tables to be connected include a first data table and a second data table, and the data skew Key is from the first data table and/or the second data table.

In one possible design, the number of the data skew keys is multiple, the number of the first sub-tables to be connected is multiple, and the data skew keys and the first sub-tables to be connected correspond to each other one by one; the modification module is specifically configured to: for each data-skewed Key, the first to-be-connected sub-table corresponding to the data-skewed Key performs Map Join when Spark executes a physical execution plan.

In one possible design, the data skew keys are plural, and the first to-be-connected sub-table is at least one; the modification module is further to:

the data skew keys are multiple, and the first sub-table to be connected is at least one.

For each packet, the first to-be-connected sub-table corresponding to the packet performs Map Join when Spark executes a physical execution plan.

In one possible design, the apparatus further includes:

and the syntax analysis module is used for analyzing the Structured Query Language (SQL) text into a syntax tree and generating an unresolved logic execution plan.

And the analysis module is used for analyzing the unresolved logic execution plan to obtain a logic execution plan.

A creation module to add the query node to the logic execution plan.

In one possible design, the generating module is specifically configured to:

and updating the modified logic execution plan to obtain an updated logic execution plan.

And optimizing the updated logic execution plan to obtain an optimized logic execution plan.

And converting the optimized logic execution plan into a physical execution plan.

In one possible design, the merge-end Join is performed when the second to-be-connected sub-table Spark executes the physical execution plan.

In one possible design, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort merge link SortMergeJoin.

The data tilt processing device provided in the embodiment of the present invention may be used to implement the above method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.

Fig. 11 is a schematic diagram of a hardware structure of a data tilt processing device according to an embodiment of the present invention, where the device may be a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Device 110 may include one or more of the following components: processing component 1101, memory 1102, power component 1103, input/output (I/O) interface 1104, and communications component 1106.

The processing component 1101 generally controls the overall operation of the device 110, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1101 may include one or more processors 1105 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1101 can include one or more modules that facilitate interaction between the processing component 1101 and other components. For example, the processing component 1101 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 1101.

The memory 1102 is configured to store various types of data to support operation at the device 110. Examples of such data include instructions for any application or method operating on device 110, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1102 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1103 provides power to the various components of the device 110. The power components 1103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 110.

The I/O interface 1104 provides an interface between the processing component 1101 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Communications component 1106 is configured to facilitate communications between device 110 and other devices in a wired or wireless manner. The device 110 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1106 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1106 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the device 110 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

The present application also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the data tilt processing method performed by the data tilt processing apparatus is implemented.

The computer-readable storage medium may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

An embodiment of the present invention further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the data tilt processing method executed by the data tilt processing apparatus is implemented.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data skew processing method, comprising:

2. The method according to claim 1, wherein the obtaining a Key of data skew by querying a to-be-connected data table through a query node in a logic execution plan comprises:

for each Key of the data table to be connected, comparing the data volume corresponding to the Key with a first preset threshold;

3. The method of claim 1 or 2, wherein the modifying the logic execution plan according to the data-skewed Key and the data-skew policy comprises:

adding the following processing steps to the logic execution plan:

splitting the data of the data table to be connected according to the Key of the data inclination to obtain a first sub table to be connected and a second sub table to be connected corresponding to the Key of the data non-inclination;

4. The method according to claim 3, wherein said merging the connected first sub-table to be connected and the connected second sub-table to be connected comprises:

5. The method according to claim 3, wherein the data tables to be connected comprise a first data table and a second data table, and the data-skewed Key is derived from the first data table and/or the second data table.

6. The method according to claim 5, wherein there are a plurality of the data-skewed keys, a plurality of the first sub-tables to be connected, and a one-to-one correspondence between the data-skewed keys and the first sub-tables to be connected;

the first to-be-connected sub-table corresponding to the data-skewed Key obtained by splitting the to-be-connected data table performs Map Join when Spark executes a physical execution plan, and the Map Join method includes:

for each data-skewed Key, the first to-be-connected sub-table corresponding to the data-skewed Key performs Map Join when Spark executes a physical execution plan.

7. The method according to claim 5, wherein the data-skewed Key is plural, the first to-be-connected sub-table is at least one;

the modifying the logic execution plan according to the data skew Key and the data skew policy further includes:

grouping the data-skewed keys to obtain a plurality of groups; the total data volume of each packet is smaller than a second preset threshold;

each group corresponds to one first to-be-connected sub-table;

8. The method of claim 3, wherein before querying the to-be-connected data table by the query node in the logic execution plan, the method further comprises:

analyzing the Structured Query Language (SQL) text into a syntax tree and generating an unresolved logic execution plan;

analyzing the unresolved logic execution plan to obtain a logic execution plan;

adding the query node to the logic execution plan.

9. The method of claim 3, wherein generating the physical execution plan based on the modified logical execution plan comprises:

updating the modified logic execution plan to obtain an updated logic execution plan;

optimizing the updated logic execution plan to obtain an optimized logic execution plan;

10. The method according to claim 3, wherein the second pending sublist Spark executes a merge end Join when executing the physical execution plan.

11. The method as claimed in claim 10, wherein the Map Join is broadcast table link BroadcastHashJoin, and the Reduce Join is sort merge link SortMergeJoin.

12. A data skew processing apparatus, comprising:

13. A data skew processing apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the data tilt processing method of any of claims 1 to 11.

14. A computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement a data tilt processing method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program realizes the data tilt processing method of any of claims 1 to 11 when executed by a processor.