WO2023045295A1

WO2023045295A1 - Data skew processing method, device, storage medium, and program product

Info

Publication number: WO2023045295A1
Application number: PCT/CN2022/084642
Authority: WO
Inventors: 魏秀利
Original assignee: 北京沃东天骏信息技术有限公司; 北京京东世纪贸易有限公司
Priority date: 2021-09-27
Filing date: 2022-03-31
Publication date: 2023-03-30
Also published as: CN113821541A

Abstract

A data skew processing method, a device, a storage medium, and a program product, which relate to the technical field of computers. The method comprises: by means of a query node in a logic execution plan, querying a data table to be connected to obtain a data skew key; modifying the logic execution plan according to the data skew key and a data skew policy, so that a first sub-table to be connected that corresponds to the data skew key obtained by splitting from the data table to be connected executes Map Join when Spark executes a physical execution plan; and generating the physical execution plan according to the modified logic execution plan. According to the method provided in the present embodiment, by modifying the logic execution plan, many problems caused by data skew are prevented at source, thus reducing the limitations on usage scenarios, avoiding dependence on statistical information, and improving the comprehensiveness and accuracy of data skew processing.

Description

Data skew processing method, device, storage medium and program product

This application claims the priority of the Chinese patent application with the application number 202111139049.1 and the application name "data skew processing method, equipment, storage medium and program product" submitted to the China Patent Office on September 27, 2021, the entire content of which is incorporated by reference incorporated in this application.

technical field

The embodiments of the present invention relate to the field of computer technology, and in particular, to a data skew processing method, device, storage medium, and program product.

Background technique

For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of data is not enough, leading to the concentration of a large amount of data on one or several service nodes, which is called data skew. Take the distributed computing engine Spark as an example. When the Spark computing engine performs shuffle, it needs to pull the same key Key on each node to a task task on a certain node for processing. The running progress of the entire Spark job is It is determined by the task with the longest running time, so the data skew of some keys will reduce the overall computing efficiency of Spark.

In the existing technology, Adaptive Query Execution (AQE) technology can be introduced from the engine core level. AQE uses runtime statistics to automatically optimize query execution for the above-mentioned data skew problem, and dynamically finds the amount of skewed data. And divide the skewed partition into smaller sub-partitions for processing.

However, in the process of implementing the present invention, the inventors found that the prior art center has at least the following problems: the optimization of data tilt through AOE technology depends on the accuracy of statistical information, and only supports part of the scene, for example, only supports the same stage The scenario of only one link Join has limitations.

Contents of the invention

Embodiments of the present invention provide a data skew processing method, device, storage medium, and program product, so as to improve the comprehensiveness and accuracy of data skew processing.

In a first aspect, an embodiment of the present invention provides a data skew processing method, including:

Query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;

Modify the logical execution plan according to the key of the data skew and the data skew strategy, so that the first subtable to be connected corresponding to the key of the data skew obtained from the split of the data table to be connected is distributed When the formula computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed;

Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.

In a possible design, the data table to be connected is queried through the query node in the logical execution plan to obtain the Key of data skew, including:

For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;

If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.

In a possible design, modifying the logic execution plan according to the key and data skew strategy of the data skew includes:

Add the following processing steps in the logic execution plan:

Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;

Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.

In a possible design, the merging of the connected first subtable to be connected and the connected second subtable to be connected includes:

Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.

In a possible design, the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.

In a possible design, there are multiple keys for the data tilt, multiple first sub-tables to be connected, and a one-to-one correspondence between the key for the data tilt and the first sub-table to be connected;

The first subtable to be connected corresponding to the Key corresponding to the data tilt obtained by splitting the data table to be connected carries out Map Join when Spark executes the physical execution plan, including:

For each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.

In a possible design, there are multiple keys for the data skew, and at least one first subtable to be connected;

The modifying the logic execution plan according to the key of the data skew and the data skew policy also includes:

grouping the data-slanted Keys to obtain multiple groups; the total data volume of each group is less than a second preset threshold;

Each group corresponds to one of the first to-be-connected sub-tables;

The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:

For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.

In a possible design, before querying the data table to be connected through the query node in the logical execution plan, it also includes:

Parsing the structured query language SQL text into a syntax tree to generate an unparsed logic execution plan; parsing the unparsed logic execution plan to obtain a logic execution plan;

The query node is added to the logical execution plan.

In a possible design, the generating the physical execution plan according to the modified logical execution plan includes:

Updating the modified logical execution plan to obtain an updated logical execution plan;

Optimizing the updated logical execution plan to obtain the optimized logical execution plan;

Converting the optimized logical execution plan into a physical execution plan.

In a possible design, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join of the merge end link is performed.

In a possible design, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.

In a second aspect, an embodiment of the present invention provides a data skew processing device, including:

The query module is used to query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;

A modifying module, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the first waiting list corresponding to the data skewed Key split from the data table to be connected The connection sub-table is linked to Map Join when the distributed computing engine Spark executes the physical execution plan;

The generating module is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.

In a possible design, the query module is specifically used for:

In a possible design, the modification module is specifically used for:

Add the following processing steps in the logic execution plan:

In a possible design, the modification module is specifically used for:

In a possible design, there are multiple Keys for the data tilt, and there are multiple first subtables to be connected, and the Keys for the data tilt are in one-to-one correspondence with the first subtable to be connected; The modification module is specifically used for: for each data-slanted Key, the first subtable to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.

In a possible design, there are multiple keys for the data skew, and at least one first subtable to be connected; the modification module is also used for:

There are multiple keys for the data skew, and at least one first subtable to be connected;

In a possible design, the device further includes:

The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan;

A parsing module, configured to parse the unparsed logical execution plan to obtain a logical execution plan;

A creation module is used for adding the query node in the logical execution plan.

In a possible design, the generating module is specifically used for:

Converting the optimized logical execution plan into a physical execution plan.

In a third aspect, an embodiment of the present invention provides a data skew processing device, including: at least one processor and a memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method described in the above first aspect and various possible designs of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the described method.

In a fifth aspect, an embodiment of the present invention provides a computer program product, including a computer program. When the computer program is executed by a processor, the method described in the above first aspect and various possible designs of the first aspect is implemented.

In a sixth aspect, an embodiment of the present invention provides a chip for running instructions, the chip includes a memory and a processor, codes and data are stored in the memory, the memory is coupled to the processor, and the processor runs The codes in the memory enable the chip to execute the method described in the above first aspect and various possible designs of the first aspect.

In a seventh aspect, an embodiment of the present invention provides a computer program, which is used to execute the method described in the first aspect and various possible designs of the first aspect when the computer program is executed by a processor.

The data tilt processing method, device, storage medium and program product provided in this embodiment, the method queries the data table to be connected through the query node in the logic execution plan, obtains the key Key of data tilt, and according to the Key and The data skew strategy modifies the logical execution plan, so that the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected executes the physical execution plan in the distributed computing engine Spark When performing mapping end link Map Join, generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark. The data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

Fig. 1 is a schematic flow chart of linking two data tables provided by an embodiment of the present invention;

Fig. 2 is a schematic flow diagram of linking operations provided by the prior art through spallation tilt Key;

FIG. 3 is a schematic flow chart of a data skew processing method provided by an embodiment of the present invention;

FIG. 4 is a schematic flowchart of a data skew processing method provided by another embodiment of the present invention;

Fig. 5 is a schematic flow chart of linking two data tables provided by another embodiment of the present invention;

Fig. 6 is the directed acyclic graph that two data tables provided by the prior art are linked and operated;

Fig. 7 is a directed acyclic graph of two data tables provided by another embodiment of the present invention for link operation;

Fig. 8 is the directed acyclic graph that two data tables that another embodiment of the present invention provides carry out link operation;

FIG. 9 is a schematic flowchart of a data skew processing method provided by yet another embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention;

FIG. 11 is a schematic diagram of a hardware structure of a data skew processing device provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of volume data is not enough, resulting in a large amount of data being concentrated on one or several service nodes, which is called data skew.

When the distributed computing engine Spark performs shuffling, it needs to pull the same key on each node to a task on a certain node for processing, such as performing operations such as aggregation or link join according to the key. At this time, if the amount of data corresponding to a certain Key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data, then most tasks may only be allocated to 10 pieces of data, and then run in a few seconds; but individual tasks may be allocated to 1 million data, to run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.

Fig. 1 is the flow schematic diagram that two data tables provided by an embodiment of the present invention carry out link operation; As shown in Fig. The data in the column is the mathematics competition results corresponding to different IDs. Table 1-2 shows the students' English competition results. The first column of data includes the student ID number, and the second column of data is the English competition results corresponding to different IDs. From the two It can be seen from the table that the student with ID 001 in Table 1-1 has a large number of grades. When the two tables are joined, each ID is equivalent to a Key. Obviously, the amount of data corresponding to Key001 is relatively large. When the task corresponding to Key001 is processed, it takes more time than other keys. Therefore, it can be said that Key001 has data skew. Of course, this is just an example description for a more vivid understanding of data skew. In actual applications, the data volume determination conditions corresponding to skewed keys can be set as needed.

For the above-mentioned problem of data skew, in the prior art, two ways are usually used to deal with it.

One way is to deal with it from the application layer: use Rand and other technologies to break up the skewed key, that is, add a random suffix to the skewed key, and fragment the originally skewed data. As shown in Figure 2, Key001 in Table 1-1 and Table 2-1 is spallated by Rand technology, and Key, 001-1, 001-2 and 001- in Table 1-2 and Table 2-1 can be obtained 3. Table 1-2 and Table 2-2 after spallation are based on the Key after spallation. However, in this way, on the one hand, the original business logic will be destroyed, and simple problems will often be complicated. On the other hand, once the fetch failure occurs, when recalculating the data, taking Key001 in Figure 2 as an example, it needs to Re-sparse, the 001-1 obtained in the last spallation may become 001-5 in the next spallation, so that the same piece of data is allocated to different data partitions, eventually resulting in data duplication.

Another way is to process from the engine kernel layer: Spark kernel introduces adaptive query execution technology (Adaptive Query Execution, AQE), AQE uses runtime statistics to automatically optimize query execution for the above data skew problem, and dynamically finds skewed data , and divide the skewed partitions into smaller subpartitions for processing. However, in this way, on the one hand, AQE technology strongly relies on runtime statistical information. If the statistical information is inaccurate, it will lead to misjudgment or omission of data skew; on the other hand, AQE technology only supports only A join scenario does not support the multi-table join scenario, and the AQE data skew optimization logic will not be triggered for the scenario with a shuffle after the join. On the other hand, AQE-based data skew processing is positioned at the Partition granularity. Skewed governance, if the skewed data is generated by the same Mapper, it cannot be resolved.

It can be seen that both the optimization from the application layer and the optimization based on the introduction of AQE technology based on the Spark kernel have their defects. Aiming at the above technical problems, the inventor found that the AQE technology belongs to the optimization of the Physical Plan of the physical execution plan. In the entire execution plan of Spark SQL, it mainly includes: Logical Plan of execution plan and Physical Plan of physical execution plan. The latter is composed of The transformation of the former means that if we start with the logical execution plan and optimize the data skew, we can solve the problem of data skew from the root and avoid defects at the application layer or physical execution plan level. Based on this, the embodiment of the present invention provides a data skew processing method. By modifying the logical execution plan, data fragmentation is effectively divided, and the skew key is optimized from Reduce Join to Map Join to improve data processing capabilities. It avoids many problems caused by data skew, reduces scene restrictions, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.

The technical solution of the present invention will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

FIG. 3 is a schematic flowchart of a data skew processing method provided by an embodiment of the present invention. As shown in Figure 3, the method includes:

301. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.

In this embodiment, the data table to be connected may include at least two data tables, for example, it may include the first data table and the second data table, and the first data table and the second data table may be respectively processed by the query node in the logical execution plan. Tables are queried separately, and can also include more than three data tables, such as the first data table, the second data table, and the third data table. During the query process, the first data table, the second data table, and the third data table can be sequentially Table etc. to query. In this embodiment, the number of data tables to be connected and the query order are not limited.

Optionally, in some embodiments, the querying the data table to be connected through the query node in the logical execution plan to obtain the key of the data skew includes: for each Key of the data table to be connected, the The data amount corresponding to the Key is compared with a first preset threshold; if the data amount corresponding to the Key is greater than the first preset threshold, the Key is determined as a key with data skew.

Specifically, in the process of querying each data table in the data table to be connected through the query node in the logical execution plan, the query can be performed for each Key, as shown in Figure 1, assuming Table 1-1 and Table 1-2 For the data table to be connected, in a feasible way, you can first query Key001 in Table 1-1, and then query other keys such as Key002 and Key003 in turn, and then query Key001 in Table 2-1, and then Query other keys such as Key002 and Key003 in turn. In another achievable way, you can first sort the number of entries corresponding to different keys to form a new sequence, and query the corresponding keys in turn according to the order in the sequence, from the number of corresponding entries to the number of corresponding entries. . This method is conducive to finding the skewed key as soon as possible, and if the number of keys is too small to become a skewed key, the query can be stopped in time to save the amount of computation. The specific query mode to be used may be determined according to actual needs, which is not limited in this embodiment.

In the process of querying each Key, the amount of data corresponding to the Key may be compared with a first preset threshold. When the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as an inclined Key. The amount of data here may be the size of the data capacity, such as how many megabytes, how many G, or the number of records. The first preset threshold may be a fixed value determined empirically.

302. Modify the logic execution plan according to the data skew key and the data skew policy, so that the first subtable to be joined corresponding to the data skew key split from the data table to be joined When the distributed computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed.

In this embodiment, the logical execution plan is modified based on the data skew Key and the data skew strategy, so that the first word table to be connected corresponding to the data skew Key obtained by splitting from the data table to be connected will be processed in the subsequent physical execution plan. The Spark computing engine can connect to the mapping side during execution.

Both Map Join at the mapping end and Reduce Join at the merging end are used to join the data tables to be joined. That is, the merge operation of data from different data sources. Among them, Reduce Join is to complete the marking of data in the Map stage, and complete the merging of data in the Reduce stage. Map Join is to complete the merging of data directly in the Map stage, without the Reduce stage.

Specifically, as shown in Table 1-1 and Table 2-1 in Figure 1, take this as an example to illustrate the Reduce Join process. In the Map stage of Reduce Join, the input data will be uniformly encapsulated into a Bean. This Bean Contains all public and non-public attributes of Table 1-1 and Table 2-1, which is equivalent to a full outer connection, and adds a new attribute, the file name, to distinguish whether the data comes from Table 1-1 or Table 2- 1. It is convenient for data processing in the Reduce phase; the Key output by the Map is the ID of the student ID, and the Value is the Bean. In the Shuffle phase, the beans are sorted according to the ID, and all the data with the same ID will be aggregated under the same key and sent to the same Reduce task; Source, whether it is Table 1-1 or Table 2-1. However, if Map Join is performed on Table 1-1 and Table 2-1, there is no Reduce process, and all work is completed in the Map phase, which greatly reduces the cost of network transmission and input and output. In the specific implementation process: you can pre-cache Table 1-1 or Table 2-1, for example, Table 2-1 in each Map task node, and then directly use Table 1-1 when the data in Table 1-1 is transmitted. Just connect the pre-stored data in Table 2-1 and output the data.

In this embodiment, taking table 1-1 and table 2-1 in Fig. 1 as an example, if the Key of the data tilt is 001, then the first sub-table to be connected includes the key corresponding to 001 from table 1-1 The first sub-data table comes from the second sub-data table corresponding to 001 in Table 2-1. The first to-be-connected sub-table corresponding to the Key of the data tilt obtained by splitting from the data table to be connected performs a mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, which can be in Sark When executing the physical execution plan, map join the first sub-data table and the second sub-data table, specifically, the first sub-data table or the second sub-data table, for example, pre-cache the second sub-data table in each Map task node, and when the data of the first sub-data table is transmitted, directly connect the pre-stored data of the second sub-data table with the data of the first sub-data table and output it.

303. Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan by using Spark.

In this embodiment, Spark's execution of the API in the code mainly includes the following steps: first write the DataFrame/Dataset/SQL code; secondly, if there is no error in the written code, Spark will convert these codes into a logical execution plan; again, Spark will After performing a series of optimizations on the generated logical execution plan, the optimized logical execution plan is converted into a physical execution plan; finally, Spark executes the physical execution plan, that is, the Resilient Distributed Datasets (Resilient Distributed Datasets, RDD) to perform a series of operations.

In this embodiment, after the logical execution plan is modified according to the key of the data skew and the data skew policy, a physical execution plan may be generated based on the modified logical execution plan. Certainly, the logical execution plan to be modified may be an optimized logical execution plan or an unoptimized logical execution plan, which is not limited in this embodiment.

In the data skew processing method provided in this embodiment, the data table to be connected is queried through the query node in the logical execution plan to obtain the key Key of data skew, and the logical execution plan is executed according to the key of the data skew and the data skew strategy. Modify, so that the first subtable to be connected corresponding to the Key corresponding to the data tilt obtained from the data table to be split is carried out the mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, according to the modification After the logical execution plan, generate the physical execution plan, so as to execute the physical execution plan through Spark. The data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.

FIG. 4 is a schematic flow chart of a data skew processing method provided by another embodiment of the present invention. As shown in FIG. 4 , on the basis of the above embodiments, for example, on the basis of the embodiment shown in FIG. 3 , how to modify the logic modification plan is described in detail in this embodiment, and the method includes:

401. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.

In this embodiment, step 401 is similar to step 301 in the foregoing embodiment, and will not be repeated here.

402. Add the key according to the data skew to the logic execution plan, split the data in the data table to be connected, and obtain the first subtable to be joined and the second subtable corresponding to the data non-slanted key. The processing step of the sub-table to be connected, so that the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected is mapped when the distributed computing engine Spark executes the physical execution plan Link Map Join.

403. Add a processing step of merging the connected first subtable to be connected and the second connected subtable to obtain the final data table in the logic execution plan, and obtain the modified The logical execution plan.

In some embodiments, the data tables to be connected include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table. That is to say, any data table in the data table to be connected may be determined to have a key with data skew. As shown in Figure 1, only one of the two data tables has a key with skewed data, that is, Key001 in Table 1-1. As shown in Figure 5, there are keys with skewed data in both data tables, that is, the key with skewed data in Table 1-1 is 001, and the key with skewed data in Table 4-1 is 003. In this embodiment, there is no limitation on the source and quantity of keys for data skew.

In order to more vividly modify the logic execution plan in this embodiment, the connection process of the data tables to be connected Table 1-1 and Table 4-1 will be illustrated below in conjunction with FIG. 5 .

As shown in Figure 5, the key of the data skew in Table 1-1 is 001, and the key of the data skew in Table 4-1 is 003. Therefore, after splitting Table 1-1, get Table 1-1-1 including data corresponding to data skew Key001 and 003, and Table 1-1-2 including data non-skew Key002; in Table 4-1 After splitting, get Table 4-1-1 including data corresponding to data skew Key001 and 003, and Table 4-1-2 including data non-skewed Key002, that is, in the table 1-1 and After Table 4-1 is split according to the key of the data skew, the first sub-table to be connected is obtained consisting of Table 1-1-1 and Table 4-1-1, and Table 1-1-2 and Table 4-1 The second to-be-connected subtable composed of -2. Table 1-1 and Table 4-1 in Figure 5 are just examples, and only 3 Keys from 001 to 003 are shown in order to describe the data table connection process. In an actual data table, the data volume of Keys can reach tens of thousands or tens of millions. Table 1-1-1 and Table 4-1-1 in the sub-table to be joined are both small tables that can be used for Map Join when Spark executes the physical execution plan, while Table 1-1-1 in the second sub-table to be joined 2 and Table 4-1-2 are large tables that can be used for Reduce Join when Spark executes the physical execution plan. The implementation process of Map Join and Reduce Join can refer to the description of step 302, and will not be repeated here.

After Map Join is performed on the first subtable to be connected 1-1-1 and table 4-1-1, table 5-1 is obtained, and the second subtable to be connected is performed on table 1-1-2 and table 4-1-2 After Reduce Join, Table 5-2 is obtained, and Table 5-1 and Table 5-2 are merged to obtain the final data table Table 5. In some embodiments, the merging the connected first to-be-connected sub-table and the connected second to-be-connected sub-table includes: using a Union operator to combine the connected first to-be-connected sub-table The to-be-connected sub-table is merged with the connected second to-be-connected sub-table to obtain a final data table.

In some embodiments, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join is performed. Optionally, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.

In practical applications, there are many ways of Map Join of the first subtable to be connected. After modifying the logical execution plan in this embodiment in combination with FIG. 7 to FIG. 8, the physical execution plan of the first subtable to be connected is executed in Spark The method of performing Map Join is given as an example, and in conjunction with FIG. 6 , the comparison of advantages before and after adopting the data skew processing method provided by this embodiment is described.

In order to ensure that the first subtable to be connected corresponding to the Key of the data tilt can realize Map Join, in some embodiments, the Key of the data tilt is multiple, the first subtable to be connected is multiple, and the The key of the data skew is in one-to-one correspondence with the first subtable to be connected; the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected is physically executed in Spark Map Join is performed during planning, including: for each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan. As shown in Figure 8, take the table 1-1-1 and table 4-1-1 of the data table to be connected in Figure 5 as an example, and each key of the data slope corresponds to a first child table to be connected , that is, when splitting Table 1-1, two first sub-tables to be connected can be obtained by splitting. The first sub-table to be connected only includes the data of Key001, while the other sub-table to be connected only includes Data of Key002. In the subsequent Map Join process, as shown in Figure 8, Stage1 and Stage2 target a first sub-table to be joined, and Stage3 and Stage4 target another first sub-table to be joined. Specifically, both realize the key of data skew through BroadcastHashJoin The Map Join. In the directed acyclic graph shown in Figure 8, Stage5 and Stage6 are for the second sub-table to be joined. After processing by Stage5 and Stage6, the Reduce Join of the key with non-slanted data is realized through SortMergeJoin. In Stage7, the two first to-be-joined sub-tables executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table.

In order to reduce the calculation amount of the merging operation and reduce the number of computing nodes used to process the first subtable to be connected, in some embodiments, there are multiple keys for the data skew, and the first subtable to be connected It is at least one; modifying the logic execution plan according to the key of the data tilt and the data tilt strategy, further comprising: grouping the keys of the data tilt to obtain multiple groups; the total data of each group The amount is less than the second preset threshold; each group corresponds to one of the first sub-tables to be connected; the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected Performing Map Join when Spark executes the physical execution plan includes: for each group, performing Map Join on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan. Taking the data table 1-1-1 and table 4-1-1 to be connected in Figure 5 as an example including two data skew Keys, the data skew

Keys

001 and 003 can be grouped based on the second preset threshold, for example, assuming If the sum of the data amounts corresponding to 001 and 003 is less than the second preset threshold, 001 and 003 can be divided into one group as shown in FIG. 5 , and the data table to be connected has and only has this one group. As shown in Figure 7, for this grouping, a first subtable to be connected can be obtained, and for this subtable to be connected, after performing the processing of Stage1 and Stage2, BroadcastHashJoin can be performed to realize the first subtable to be connected Map Join. In the directed acyclic graph shown in Figure 7, Stage3 and Stage4 are for the second sub-table to be joined. After processing by Stage3 and Stage4, the Reduce Join of the key with non-slanted data is realized through SortMergeJoin. In Stage5, the first to-be-joined sub-table executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table. In this embodiment, the setting of the second preset threshold may be set according to experience, which is not limited in this embodiment.

It can be understood that if there are multiple, for example 100, data skewed keys in the data table to be connected, the 100 data skewed keys can be grouped based on the second preset threshold in various ways. In a practicable way, the keys with data skew can be sorted according to the number of the key, and then the data volume of the first key in the sorting can be judged based on the second preset threshold, if it is less than the second preset threshold , the total amount of data of the first and second Key is judged based on the second preset threshold, and if it is still less than the second preset threshold, the total amount of data of the first, second, and third Key is Quantity is judged based on the second preset threshold, until it exceeds the second preset threshold, then the Key before the N-1th is classified into a group, and the judgment of the above steps is continued from the Nth. Until all keys in the sorting are traversed. In another practicable manner, the Keys with skewed data may be sorted according to the amount of data corresponding to each Key. Then the keys in the sorting are grouped based on the second preset threshold. This is not limited in this embodiment, and can be selected according to actual needs.

FIG. 6 is a directed acyclic graph of two data tables provided in the prior art for link operation. Fig. 7 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention. Fig. 8 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention. As shown in Figure 6, both stage 1Stage1 and stage 2Stage2 include the following steps in sequence: range Range, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort; stage 3Stage3 includes sort merge link SortMergeJoin. As shown in Figure 7, stage 1Stage1 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and broadcast exchange BroadcastExchage; stage 2Stage2 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader; stage 3Stage3 and stage 4Stage4 both include range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort; stage 5 includes the following steps: broadcast table link BroadcastHashJoin, sort merge link SortMergeJoin, merge Union and adaptive Spark execution plan AdaptiveSparkPlan.

The following still uses Table 1-1 and Table 4-1 shown in Figure 5 as an example to illustrate Figure 6 to Figure 8. As shown in Figure 6, the two data tables are linked, usually through Table 1-1. For the processing of Stage1, after Table 4-1 is processed by Stage2, SortMergeJoin is performed in Stage3 to realize Reduce Join. As shown in Figure 7, perform Stage3 processing on Table 1-1-2, and perform Stage4 processing on Table 4-1-1, then perform SortMergeJoin in Stage5 to implement ReduceJoin to obtain Table 5-2, and perform Table 1-1- 1 and one of the tables in Table 4-1-2 is processed by Stage1, and after the other table is processed by Stage2, the BroadcastHashJoin in Stage5 is implemented to implement MapReduce, and Table 5-1 is obtained, and finally Table 5-1 and Table 5 are combined -2 After the Union is performed, execute the adaptive Spark execution plan. Compared with Figure 7, Figure 8 splits the data with skewed data into multiple groups of sub-tables, that is, when the number of keys with skewed data is too large, table 1-1-1 is further refined and split to obtain multiple sub-tables. Sub-tables of the next level, corresponding to Table 4-1-2 is also split to obtain multiple sub-tables, and then for each group of sub-tables (including a sub-sub-table of Table 1-1-1 table, and the next-level sub-table of Table 4-1-2 corresponding to the lower-level sub-table), after processing Stage1 and Stage2, perform BroadcastHashJoin to realize multiple MapJoins. Finally, through the processing of Stage7, multiple tables obtained by BroadcastHashJoin and tables obtained by SortMergeJoin are merged, and then the adaptive Spark execution plan is executed.

Before adopting the embodiment of the present invention, that is, in the processing mode shown in Figure 6 before optimization, the Spark task running time is 5.5 minutes, after adopting the embodiment of the present invention, that is, in the processing mode shown in Figure 7 or Figure 8 after optimization, Spark The task runtime is 2.1 minutes and the overall performance is improved by 60%.

404. Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.

Step 404 in this embodiment is similar to step 303 in the above embodiment, and will not be repeated here.

The data skew processing method provided by the embodiment of the present invention splits the key of data skew and the key of non-data skew by modifying the logical execution plan, so that when Spark executes the physical execution plan, the Key of data skew can correspond to Map Join is performed on the first child table to be connected, which greatly reduces the running time. The long running time of the key that avoids data skew affects the running efficiency of the entire Spark task. And this method has no scene restrictions, which solves the problem of data skew from the root.

FIG. 9 is a schematic flow chart of a data skew processing method provided by yet another embodiment of the present invention. As shown in FIG. 9, on the basis of the above-mentioned embodiments, for example, on the basis of the embodiment shown in FIG. 3, the generation process of the logical execution plan in this embodiment and the generation of the physical execution plan from the modified logical execution plan process is described in detail. The method includes:

901. Parse the structured query language SQL text into a syntax tree, and generate an unparsed logical execution plan.

902. Analyze the unparsed logic execution plan to obtain a logic execution plan.

903. Add the query node to the logical execution plan.

In this embodiment, the logical execution plan is mainly a series of abstract conversion processes. No executors or drivers are involved, it just translates the user's set of expressions into an optimal version. Specifically, the user's code will first be converted into an unresolved logical execution plan (Unresolved Logical Plan). The reason why the unresolved logical plan is called unresolved is because the unresolved logical execution plan is not necessarily correct. The table name or column name referenced by the logical execution plan may or may not exist. Spark will then use the catalog Catalog, a metadata warehouse containing all data tables and data frames DataFrame, in the parser (Analyser) to resolve the table name or column name referenced by proofreading. If the unresolved logical execution plan passes the verification, the logical execution plan (Resolved Logical Plan) can be obtained. Add a query node (Query Node) in the logical execution plan to query the data table to be connected through the query node.

904. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.

905. Modify the logic execution plan according to the data skew Key and the data skew policy, so that the first subtable to be joined corresponding to the data skew Key split from the data table to be joined When the distributed computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed.

Step 904 and step 905 in this embodiment are similar to step 301 and step 302 in the above embodiment, and will not be repeated here.

906. Update the modified logical execution plan to obtain an updated logical execution plan.

907. Optimize the updated logic execution plan to obtain an optimized logic execution plan.

908. Convert the optimized logical execution plan into a physical execution plan.

Specifically, after the logical execution plan is modified, it needs to be updated, so that subsequent steps can apply the updated logical execution plan. Pass the modified logical execution plan to the optimizer Catalyst Optimizer for optimization, and then generate an optimized logical execution plan through a series of optimizations. Spark translates this logical execution plan into a physical execution plan, checking for feasible optimization strategies, and checking for optimizations along the way. The physical execution plan determines how to execute the logical plan on the cluster by generating different physical execution operations and conducting comparative analysis through the cost model. When Spark chooses a physical plan, Spark runs all codes on Spark's underlying programming interface RDD. Spark performs further optimizations at runtime, generates native Java bytecode that can optimize tasks or stages during execution, and finally returns the results to the user.

In the data skew processing method provided by the embodiment of the present invention, by adding query nodes before the optimization of the logical execution plan, and splitting the queried data skewed Key from the data non-slanted Key, Spark executes the physical execution plan , it is possible to perform Map Join on the skewed Key of the split data corresponding to the first sub-table to be joined, which can greatly save the running time. It avoids many problems caused by data skew from the root, reduces scene restrictions, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.

FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention. As shown in FIG. 10 , the data skew processing device 100 includes: a query module 1001 , a modification module 1002 and a generation module 1003 .

The query module 1001 is configured to query the data table to be connected through the query node in the logic execution plan, and obtain the key Key of the data skew.

A modifying module 1002, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the data skewed key corresponding to the first When the distributed computing engine Spark executes the physical execution plan for the subtables to be joined, Map Join is performed at the mapping end.

The generating module 1003 is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.

The data skew processing device provided by the embodiment of the present invention,

In a possible design, the query module is specifically used for:

In a possible design, the modification module is specifically used for:

Add the following processing steps in the logic execution plan:

In a possible design, the modification module is specifically used for:

The modification module is specifically used for: for each data-slanted Key, the first sub-table to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.

In a possible design, the device further includes:

The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan.

The parsing module is configured to parse the unparsed logical execution plan to obtain the logical execution plan.

In a possible design, the generating module is specifically used for:

Converting the optimized logical execution plan into a physical execution plan.

The data skew processing device provided by the embodiment of the present invention can be used to execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, so this embodiment will not repeat them here.

Fig. 11 is a schematic diagram of the hardware structure of a data tilt processing device provided by an embodiment of the present invention. The device may be a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc. .

Device 110 may include one or more of the following components: processing component 1101 , memory 1102 , power supply component 1103 , input/output (I/O) interface 1104 , and communication component 1106 .

Processing component 1101 generally controls the overall operations of device 110, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1101 may include one or more processors 1105 to execute instructions to complete all or part of the steps of the above method. Additionally, the processing component 1101 may include one or more modules to facilitate interaction between the processing component 1101 and other components. For example, processing component 1101 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 1101 .

Memory 1102 is configured to store various types of data to support operations at device 110 . Examples of such data include instructions for any application or method operating on device 110, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1102 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The power supply component 1103 provides power to various components of the device 110 . Power components 1103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 110 .

The I/O interface 1104 provides an interface between the processing component 1101 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Communication component 1106 is configured to facilitate wired or wireless communications between device 110 and other devices. The device 110 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1106 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1106 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, device 110 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

The present application also provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the data skew processing method performed by the above data skew processing device is realized.

The above-mentioned computer-readable storage medium, the above-mentioned readable storage medium can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC for short). Of course, the processor and the readable storage medium can also exist in the device as discrete components.

Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.

An embodiment of the present invention also provides a computer program product, including a computer program. When the computer program is executed by a processor, the above data skew processing method performed by the data skew processing device is implemented.

The embodiment of the present invention also provides a chip for running instructions. The chip includes a memory and a processor. Codes and data are stored in the memory. The memory is coupled to the processor. The processor runs the memory in the memory. The code enables the chip to execute the data skew processing method performed by the above data skew processing device.

An embodiment of the present invention further provides a computer program, which is used to execute the data skew processing method performed by the above data skew processing device when the computer program is executed by a processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims

A data skew processing method, characterized in that, comprising:

Query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;

Modify the logical execution plan according to the key of the data skew and the data skew strategy, so that the first subtable to be connected corresponding to the key of the data skew obtained from the split of the data table to be connected is distributed When the formula computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed;

Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
The method according to claim 1, wherein the step of querying the data table to be connected through the query node in the logic execution plan to obtain the Key of data skew includes:

For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;

If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
The method according to claim 1 or 2, wherein modifying the logic execution plan according to the key and data skew strategy of the data skew includes:

Add the following processing steps in the logic execution plan:

Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;

Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
The method according to claim 3, wherein the merging the connected first sub-table to be connected and the second sub-table to be connected after the connection comprises:

Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
The method according to claim 3, wherein the data table to be connected includes a first data table and a second data table, and the Key of the data tilt comes from the first data table and/or the second data table Two data sheets.
The method according to claim 5, characterized in that there are multiple Keys for the data tilt, multiple subtables to be connected, and the Key for the tilted data and the first subtable to be connected one-to-one correspondence;

The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:

For each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.
The method according to claim 5, wherein there are multiple Keys for the data skew, and at least one of the first sub-tables to be connected;

The modifying the logic execution plan according to the key of the data skew and the data skew policy also includes:

grouping the data-slanted Keys to obtain multiple groups; the total data volume of each group is less than a second preset threshold;

Each group corresponds to one of the first to-be-connected sub-tables;

The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:

For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
The method according to any one of claims 1 to 7, wherein before querying the data table to be connected through the query node in the logical execution plan, further comprising:

Parsing the structured query language SQL text into a syntax tree to generate an unparsed logic execution plan; parsing the unparsed logic execution plan to obtain a logic execution plan;

The query node is added to the logical execution plan.
The method according to any one of claims 1 to 8, wherein generating the physical execution plan according to the modified logical execution plan includes:

Updating the modified logical execution plan to obtain an updated logical execution plan;

Optimizing the updated logical execution plan to obtain the optimized logical execution plan;

Converting the optimized logical execution plan into a physical execution plan.
The method according to any one of claims 3 to 7, characterized in that, when the second sub-table to be joined Spark executes the physical execution plan, the merge terminal link Reduce Join is performed.
The method according to claim 10, wherein said Map Join is a broadcast table link BroadcastHashJoin, and said Reduce Join is a sorting merge link SortMergeJoin.
A data skew processing device, characterized in that it comprises:

The query module is used to query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;

A modifying module, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the first waiting list corresponding to the data skewed Key split from the data table to be connected The connection sub-table is linked to Map Join when the distributed computing engine Spark executes the physical execution plan;

The generating module is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
The device according to claim 12, wherein the query module is specifically used for:

For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;

If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
The device according to claim 12 or 13, wherein the modifying module is specifically used for:

Add the following processing steps in the logic execution plan:

Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;

Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
The device according to claim 14, wherein the modifying module is specifically used for:

Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
The device according to claim 14, wherein the data table to be connected includes a first data table and a second data table, and the Key of the data tilt comes from the first data table and/or the second data table Two data sheets.
The device according to claim 16, characterized in that there are multiple Keys for the data tilt, multiple first sub-tables to be connected, and the Key for the data tilt and the first sub-table to be connected one-to-one correspondence;

The modification module is specifically used for: for each data-slanted Key, the first sub-table to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
The device according to claim 16, wherein there are multiple keys for the data skew, and at least one first sub-table to be connected;

The modification module is also used to:

There are multiple keys for the data skew, and at least one first subtable to be connected;

For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
The device according to any one of claims 12 to 18, further comprising:

The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan;

A parsing module, configured to parse the unparsed logical execution plan to obtain a logical execution plan;

A creation module is used for adding the query node in the logical execution plan.
The device according to any one of claims 12 to 19, wherein the generating module is specifically used for:

Updating the modified logical execution plan to obtain an updated logical execution plan;

Optimizing the updated logical execution plan to obtain the optimized logical execution plan;

Converting the optimized logical execution plan into a physical execution plan.
The device according to any one of claims 14 to 18, characterized in that when the second to-be-joined sub-table Spark executes the physical execution plan, reduce join of the merging end link is performed.
The device according to claim 21, wherein the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sorting merge link SortMergeJoin.
A data skew processing device, characterized in that it includes: at least one processor and a memory;

the memory stores computer-executable instructions;

The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the data skew processing method according to any one of claims 1 to 11.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the method described in any one of claims 1 to 11 is realized. Data skew processing method.
A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the data skew processing method according to any one of claims 1 to 11 is implemented.
A chip for running instructions, characterized in that the chip includes a memory and a processor, codes and data are stored in the memory, the memory is coupled to the processor, and the processor runs the code in the memory The chip is used to execute the data skew processing method described in any one of claims 1 to 11 above.
A computer program, characterized in that, when the computer program is executed by a processor, it is used to execute the data skew processing method according to any one of claims 1 to 11.