WO2023045295A1 - Data skew processing method, device, storage medium, and program product - Google Patents

Data skew processing method, device, storage medium, and program product Download PDF

Info

Publication number
WO2023045295A1
WO2023045295A1 PCT/CN2022/084642 CN2022084642W WO2023045295A1 WO 2023045295 A1 WO2023045295 A1 WO 2023045295A1 CN 2022084642 W CN2022084642 W CN 2022084642W WO 2023045295 A1 WO2023045295 A1 WO 2023045295A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
execution plan
key
skew
sub
Prior art date
Application number
PCT/CN2022/084642
Other languages
French (fr)
Chinese (zh)
Inventor
魏秀利
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2023045295A1 publication Critical patent/WO2023045295A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the embodiments of the present invention relate to the field of computer technology, and in particular, to a data skew processing method, device, storage medium, and program product.
  • a distributed cluster system For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of data is not enough, leading to the concentration of a large amount of data on one or several service nodes, which is called data skew.
  • Data skew Take the distributed computing engine Spark as an example. When the Spark computing engine performs shuffle, it needs to pull the same key Key on each node to a task task on a certain node for processing. The running progress of the entire Spark job is It is determined by the task with the longest running time, so the data skew of some keys will reduce the overall computing efficiency of Spark.
  • Adaptive Query Execution (AQE) technology can be introduced from the engine core level.
  • AQE uses runtime statistics to automatically optimize query execution for the above-mentioned data skew problem, and dynamically finds the amount of skewed data. And divide the skewed partition into smaller sub-partitions for processing.
  • the prior art center has at least the following problems: the optimization of data tilt through AOE technology depends on the accuracy of statistical information, and only supports part of the scene, for example, only supports the same stage
  • the scenario of only one link Join has limitations.
  • Embodiments of the present invention provide a data skew processing method, device, storage medium, and program product, so as to improve the comprehensiveness and accuracy of data skew processing.
  • an embodiment of the present invention provides a data skew processing method, including:
  • the data table to be connected is queried through the query node in the logical execution plan to obtain the Key of data skew, including:
  • the Key is determined as a Key with skewed data.
  • modifying the logic execution plan according to the key and data skew strategy of the data skew includes:
  • the merging of the connected first subtable to be connected and the connected second subtable to be connected includes:
  • the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
  • the first subtable to be connected corresponding to the Key corresponding to the data tilt obtained by splitting the data table to be connected carries out Map Join when Spark executes the physical execution plan, including:
  • Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.
  • the modifying the logic execution plan according to the key of the data skew and the data skew policy also includes:
  • Each group corresponds to one of the first to-be-connected sub-tables
  • the first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:
  • Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  • the query node is added to the logical execution plan.
  • the generating the physical execution plan according to the modified logical execution plan includes:
  • the Map Join is a broadcast table link BroadcastHashJoin
  • the Reduce Join is a sort-merge link SortMergeJoin.
  • an embodiment of the present invention provides a data skew processing device, including:
  • the query module is used to query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;
  • a modifying module configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the first waiting list corresponding to the data skewed Key split from the data table to be connected
  • the connection sub-table is linked to Map Join when the distributed computing engine Spark executes the physical execution plan
  • the generating module is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
  • the query module is specifically used for:
  • the Key is determined as a Key with skewed data.
  • the modification module is specifically used for:
  • the modification module is specifically used for:
  • the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
  • the modification module is specifically used for: for each data-slanted Key, the first subtable to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
  • the modification module is also used for:
  • Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  • the device further includes:
  • the syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan
  • a parsing module configured to parse the unparsed logical execution plan to obtain a logical execution plan
  • a creation module is used for adding the query node in the logical execution plan.
  • the generating module is specifically used for:
  • the Map Join is a broadcast table link BroadcastHashJoin
  • the Reduce Join is a sort-merge link SortMergeJoin.
  • an embodiment of the present invention provides a data skew processing device, including: at least one processor and a memory;
  • the memory stores computer-executable instructions
  • the at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method described in the above first aspect and various possible designs of the first aspect.
  • an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the described method.
  • an embodiment of the present invention provides a computer program product, including a computer program.
  • the computer program is executed by a processor, the method described in the above first aspect and various possible designs of the first aspect is implemented.
  • an embodiment of the present invention provides a chip for running instructions, the chip includes a memory and a processor, codes and data are stored in the memory, the memory is coupled to the processor, and the processor runs The codes in the memory enable the chip to execute the method described in the above first aspect and various possible designs of the first aspect.
  • an embodiment of the present invention provides a computer program, which is used to execute the method described in the first aspect and various possible designs of the first aspect when the computer program is executed by a processor.
  • the data tilt processing method, device, storage medium and program product provided in this embodiment, queries the data table to be connected through the query node in the logic execution plan, obtains the key Key of data tilt, and according to the Key and
  • the data skew strategy modifies the logical execution plan, so that the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected executes the physical execution plan in the distributed computing engine Spark
  • When performing mapping end link Map Join generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
  • the data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.
  • Fig. 1 is a schematic flow chart of linking two data tables provided by an embodiment of the present invention
  • Fig. 2 is a schematic flow diagram of linking operations provided by the prior art through spallation tilt Key
  • FIG. 3 is a schematic flow chart of a data skew processing method provided by an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a data skew processing method provided by another embodiment of the present invention.
  • Fig. 5 is a schematic flow chart of linking two data tables provided by another embodiment of the present invention.
  • Fig. 6 is the directed acyclic graph that two data tables provided by the prior art are linked and operated;
  • Fig. 7 is a directed acyclic graph of two data tables provided by another embodiment of the present invention for link operation;
  • Fig. 8 is the directed acyclic graph that two data tables that another embodiment of the present invention provides carry out link operation
  • FIG. 9 is a schematic flowchart of a data skew processing method provided by yet another embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a hardware structure of a data skew processing device provided by an embodiment of the present invention.
  • data skew For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of volume data is not enough, resulting in a large amount of data being concentrated on one or several service nodes, which is called data skew.
  • the distributed computing engine Spark When the distributed computing engine Spark performs shuffling, it needs to pull the same key on each node to a task on a certain node for processing, such as performing operations such as aggregation or link join according to the key. At this time, if the amount of data corresponding to a certain Key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data, then most tasks may only be allocated to 10 pieces of data, and then run in a few seconds; but individual tasks may be allocated to 1 million data, to run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.
  • Fig. 1 is the flow schematic diagram that two data tables provided by an embodiment of the present invention carry out link operation; As shown in Fig.
  • the data in the column is the mathematics competition results corresponding to different IDs.
  • Table 1-2 shows the students' English competition results.
  • the first column of data includes the student ID number, and the second column of data is the English competition results corresponding to different IDs. From the two It can be seen from the table that the student with ID 001 in Table 1-1 has a large number of grades. When the two tables are joined, each ID is equivalent to a Key. Obviously, the amount of data corresponding to Key001 is relatively large. When the task corresponding to Key001 is processed, it takes more time than other keys. Therefore, it can be said that Key001 has data skew. Of course, this is just an example description for a more vivid understanding of data skew. In actual applications, the data volume determination conditions corresponding to skewed keys can be set as needed.
  • Spark kernel introduces adaptive query execution technology (Adaptive Query Execution, AQE), AQE uses runtime statistics to automatically optimize query execution for the above data skew problem, and dynamically finds skewed data , and divide the skewed partitions into smaller subpartitions for processing.
  • AQE technology strongly relies on runtime statistical information. If the statistical information is inaccurate, it will lead to misjudgment or omission of data skew; on the other hand, AQE technology only supports only A join scenario does not support the multi-table join scenario, and the AQE data skew optimization logic will not be triggered for the scenario with a shuffle after the join.
  • AQE-based data skew processing is positioned at the Partition granularity. Skewed governance, if the skewed data is generated by the same Mapper, it cannot be resolved.
  • the AQE technology belongs to the optimization of the Physical Plan of the physical execution plan.
  • the entire execution plan of Spark SQL it mainly includes: Logical Plan of execution plan and Physical Plan of physical execution plan.
  • the latter is composed of The transformation of the former means that if we start with the logical execution plan and optimize the data skew, we can solve the problem of data skew from the root and avoid defects at the application layer or physical execution plan level. Based on this, the embodiment of the present invention provides a data skew processing method.
  • FIG. 3 is a schematic flowchart of a data skew processing method provided by an embodiment of the present invention. As shown in Figure 3, the method includes:
  • the data table to be connected may include at least two data tables, for example, it may include the first data table and the second data table, and the first data table and the second data table may be respectively processed by the query node in the logical execution plan.
  • Tables are queried separately, and can also include more than three data tables, such as the first data table, the second data table, and the third data table.
  • the first data table, the second data table, and the third data table can be sequentially Table etc. to query.
  • the number of data tables to be connected and the query order are not limited.
  • the querying the data table to be connected through the query node in the logical execution plan to obtain the key of the data skew includes: for each Key of the data table to be connected, the The data amount corresponding to the Key is compared with a first preset threshold; if the data amount corresponding to the Key is greater than the first preset threshold, the Key is determined as a key with data skew.
  • the query can be performed for each Key, as shown in Figure 1, assuming Table 1-1 and Table 1-2
  • Table 1-1 and Table 1-2 For the data table to be connected, in a feasible way, you can first query Key001 in Table 1-1, and then query other keys such as Key002 and Key003 in turn, and then query Key001 in Table 2-1, and then Query other keys such as Key002 and Key003 in turn.
  • This method is conducive to finding the skewed key as soon as possible, and if the number of keys is too small to become a skewed key, the query can be stopped in time to save the amount of computation.
  • the specific query mode to be used may be determined according to actual needs, which is not limited in this embodiment.
  • the amount of data corresponding to the Key may be compared with a first preset threshold.
  • the amount of data here may be the size of the data capacity, such as how many megabytes, how many G, or the number of records.
  • the first preset threshold may be a fixed value determined empirically.
  • mapping end link Map Join is performed.
  • the logical execution plan is modified based on the data skew Key and the data skew strategy, so that the first word table to be connected corresponding to the data skew Key obtained by splitting from the data table to be connected will be processed in the subsequent physical execution plan.
  • the Spark computing engine can connect to the mapping side during execution.
  • Map Join at the mapping end and Reduce Join at the merging end are used to join the data tables to be joined. That is, the merge operation of data from different data sources.
  • Reduce Join is to complete the marking of data in the Map stage, and complete the merging of data in the Reduce stage.
  • Map Join is to complete the merging of data directly in the Map stage, without the Reduce stage.
  • Table 1-1 and Table 2-1 in Figure 1 take this as an example to illustrate the Reduce Join process.
  • the input data will be uniformly encapsulated into a Bean.
  • This Bean Contains all public and non-public attributes of Table 1-1 and Table 2-1, which is equivalent to a full outer connection, and adds a new attribute, the file name, to distinguish whether the data comes from Table 1-1 or Table 2- 1. It is convenient for data processing in the Reduce phase; the Key output by the Map is the ID of the student ID, and the Value is the Bean.
  • the beans are sorted according to the ID, and all the data with the same ID will be aggregated under the same key and sent to the same Reduce task; Source, whether it is Table 1-1 or Table 2-1.
  • Source whether it is Table 1-1 or Table 2-1.
  • Map Join is performed on Table 1-1 and Table 2-1, there is no Reduce process, and all work is completed in the Map phase, which greatly reduces the cost of network transmission and input and output.
  • the first sub-table to be connected includes the key corresponding to 001 from table 1-1
  • the first sub-data table comes from the second sub-data table corresponding to 001 in Table 2-1.
  • the first to-be-connected sub-table corresponding to the Key of the data tilt obtained by splitting from the data table to be connected performs a mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, which can be in Sark
  • Map join the first sub-data table and the second sub-data table specifically, the first sub-data table or the second sub-data table, for example, pre-cache the second sub-data table in each Map task node, and when the data of the first sub-data table is transmitted, directly connect the pre-stored data of the second sub-data table with the data of the first sub-data table and output it.
  • Spark's execution of the API in the code mainly includes the following steps: first write the DataFrame/Dataset/SQL code; secondly, if there is no error in the written code, Spark will convert these codes into a logical execution plan; again, Spark will After performing a series of optimizations on the generated logical execution plan, the optimized logical execution plan is converted into a physical execution plan; finally, Spark executes the physical execution plan, that is, the Resilient Distributed Datasets (Resilient Distributed Datasets, RDD) to perform a series of operations.
  • RDD Resilient Distributed Datasets
  • a physical execution plan may be generated based on the modified logical execution plan.
  • the logical execution plan to be modified may be an optimized logical execution plan or an unoptimized logical execution plan, which is not limited in this embodiment.
  • the data table to be connected is queried through the query node in the logical execution plan to obtain the key Key of data skew, and the logical execution plan is executed according to the key of the data skew and the data skew strategy.
  • Modify so that the first subtable to be connected corresponding to the Key corresponding to the data tilt obtained from the data table to be split is carried out the mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, according to the modification
  • After the logical execution plan generate the physical execution plan, so as to execute the physical execution plan through Spark.
  • the data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.
  • FIG. 4 is a schematic flow chart of a data skew processing method provided by another embodiment of the present invention. As shown in FIG. 4 , on the basis of the above embodiments, for example, on the basis of the embodiment shown in FIG. 3 , how to modify the logic modification plan is described in detail in this embodiment, and the method includes:
  • step 401 is similar to step 301 in the foregoing embodiment, and will not be repeated here.
  • the processing step of the sub-table to be connected, so that the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected is mapped when the distributed computing engine Spark executes the physical execution plan Link Map Join.
  • the data tables to be connected include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table. That is to say, any data table in the data table to be connected may be determined to have a key with data skew.
  • any data table in the data table to be connected may be determined to have a key with data skew.
  • only one of the two data tables has a key with skewed data, that is, Key001 in Table 1-1.
  • there are keys with skewed data in both data tables that is, the key with skewed data in Table 1-1 is 001, and the key with skewed data in Table 4-1 is 003.
  • connection process of the data tables to be connected Table 1-1 and Table 4-1 will be illustrated below in conjunction with FIG. 5 .
  • the key of the data skew in Table 1-1 is 001
  • the key of the data skew in Table 4-1 is 003. Therefore, after splitting Table 1-1, get Table 1-1-1 including data corresponding to data skew Key001 and 003, and Table 1-1-2 including data non-skew Key002; in Table 4-1 After splitting, get Table 4-1-1 including data corresponding to data skew Key001 and 003, and Table 4-1-2 including data non-skewed Key002, that is, in the table 1-1 and After Table 4-1 is split according to the key of the data skew, the first sub-table to be connected is obtained consisting of Table 1-1-1 and Table 4-1-1, and Table 1-1-2 and Table 4-1 The second to-be-connected subtable composed of -2.
  • Table 1-1 and Table 4-1 in Figure 5 are just examples, and only 3 Keys from 001 to 003 are shown in order to describe the data table connection process. In an actual data table, the data volume of Keys can reach tens of thousands or tens of millions.
  • Table 1-1-1 and Table 4-1-1 in the sub-table to be joined are both small tables that can be used for Map Join when Spark executes the physical execution plan, while Table 1-1-1 in the second sub-table to be joined 2 and Table 4-1-2 are large tables that can be used for Reduce Join when Spark executes the physical execution plan.
  • the implementation process of Map Join and Reduce Join can refer to the description of step 302, and will not be repeated here.
  • the merging the connected first to-be-connected sub-table and the connected second to-be-connected sub-table includes: using a Union operator to combine the connected first to-be-connected sub-table The to-be-connected sub-table is merged with the connected second to-be-connected sub-table to obtain a final data table.
  • the Map Join is a broadcast table link BroadcastHashJoin
  • the Reduce Join is a sort-merge link SortMergeJoin.
  • the Key of the data tilt is multiple
  • the first subtable to be connected is multiple
  • the The key of the data skew is in one-to-one correspondence with the first subtable to be connected
  • the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected is physically executed in Spark Map Join is performed during planning, including: for each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.
  • each key of the data slope corresponds to a first child table to be connected , that is, when splitting Table 1-1, two first sub-tables to be connected can be obtained by splitting.
  • the first sub-table to be connected only includes the data of Key001, while the other sub-table to be connected only includes Data of Key002.
  • Stage1 and Stage2 target a first sub-table to be joined, and Stage3 and Stage4 target another first sub-table to be joined. Specifically, both realize the key of data skew through BroadcastHashJoin The Map Join.
  • Stage5 and Stage6 are for the second sub-table to be joined.
  • the Reduce Join of the key with non-slanted data is realized through SortMergeJoin.
  • Stage7 the two first to-be-joined sub-tables executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table.
  • modifying the logic execution plan according to the key of the data tilt and the data tilt strategy further comprising: grouping the keys of the data tilt to obtain multiple groups; the total data of each group The amount is less than the second preset threshold; each group corresponds to one of the first sub-tables to be connected; the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected
  • Map Join when Spark executes the physical execution plan includes: for each group, performing Map Join on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  • the data skew Keys 001 and 003 can be grouped based on the second preset threshold, for example, assuming If the sum of the data amounts corresponding to 001 and 003 is less than the second preset threshold, 001 and 003 can be divided into one group as shown in FIG. 5 , and the data table to be connected has and only has this one group. As shown in Figure 7, for this grouping, a first subtable to be connected can be obtained, and for this subtable to be connected, after performing the processing of Stage1 and Stage2, BroadcastHashJoin can be performed to realize the first subtable to be connected Map Join.
  • Stage3 and Stage4 are for the second sub-table to be joined.
  • the Reduce Join of the key with non-slanted data is realized through SortMergeJoin.
  • Stage5 the first to-be-joined sub-table executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table.
  • the setting of the second preset threshold may be set according to experience, which is not limited in this embodiment.
  • the 100 data skewed keys can be grouped based on the second preset threshold in various ways.
  • the keys with data skew can be sorted according to the number of the key, and then the data volume of the first key in the sorting can be judged based on the second preset threshold, if it is less than the second preset threshold , the total amount of data of the first and second Key is judged based on the second preset threshold, and if it is still less than the second preset threshold, the total amount of data of the first, second, and third Key is Quantity is judged based on the second preset threshold, until it exceeds the second preset threshold, then the Key before the N-1th is classified into a group, and the judgment of the above steps is continued from the Nth.
  • the Keys with skewed data may be sorted according to the amount of data corresponding to each Key. Then the keys in the sorting are grouped based on the second preset threshold. This is not limited in this embodiment, and can be selected according to actual needs.
  • FIG. 6 is a directed acyclic graph of two data tables provided in the prior art for link operation.
  • Fig. 7 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention.
  • Fig. 8 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention.
  • both stage 1Stage1 and stage 2Stage2 include the following steps in sequence: range Range, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort; stage 3Stage3 includes sort merge link SortMergeJoin.
  • stage 1Stage1 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and broadcast exchange BroadcastExchage;
  • stage 2Stage2 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader;
  • stage 3Stage3 and stage 4Stage4 both include range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort;
  • stage 5 includes the following steps: broadcast table link BroadcastHashJoin, sort merge link SortMergeJoin, merge Union and adaptive Spark execution plan AdaptiveSparkPlan.
  • Table 1-1 and Table 4-1 shown in Figure 5 are linked, usually through Table 1-1.
  • Stage1 For the processing of Stage1, after Table 4-1 is processed by Stage2, SortMergeJoin is performed in Stage3 to realize Reduce Join.
  • Sub-tables of the next level, corresponding to Table 4-1-2 is also split to obtain multiple sub-tables, and then for each group of sub-tables (including a sub-sub-table of Table 1-1-1 table, and the next-level sub-table of Table 4-1-2 corresponding to the lower-level sub-table), after processing Stage1 and Stage2, perform BroadcastHashJoin to realize multiple MapJoins. Finally, through the processing of Stage7, multiple tables obtained by BroadcastHashJoin and tables obtained by SortMergeJoin are merged, and then the adaptive Spark execution plan is executed.
  • the Spark task running time is 5.5 minutes
  • the embodiment of the present invention that is, in the processing mode shown in Figure 7 or Figure 8 after optimization
  • Spark The task runtime is 2.1 minutes and the overall performance is improved by 60%.
  • Step 404 in this embodiment is similar to step 303 in the above embodiment, and will not be repeated here.
  • the data skew processing method provided by the embodiment of the present invention splits the key of data skew and the key of non-data skew by modifying the logical execution plan, so that when Spark executes the physical execution plan, the Key of data skew can correspond to Map Join is performed on the first child table to be connected, which greatly reduces the running time.
  • the long running time of the key that avoids data skew affects the running efficiency of the entire Spark task. And this method has no scene restrictions, which solves the problem of data skew from the root.
  • FIG. 9 is a schematic flow chart of a data skew processing method provided by yet another embodiment of the present invention. As shown in FIG. 9, on the basis of the above-mentioned embodiments, for example, on the basis of the embodiment shown in FIG. 3, the generation process of the logical execution plan in this embodiment and the generation of the physical execution plan from the modified logical execution plan process is described in detail.
  • the method includes:
  • the logical execution plan is mainly a series of abstract conversion processes. No executors or drivers are involved, it just translates the user's set of expressions into an optimal version. Specifically, the user's code will first be converted into an unresolved logical execution plan (Unresolved Logical Plan). The reason why the unresolved logical plan is called unresolved is because the unresolved logical execution plan is not necessarily correct.
  • the table name or column name referenced by the logical execution plan may or may not exist. Spark will then use the catalog Catalog, a metadata warehouse containing all data tables and data frames DataFrame, in the parser (Analyser) to resolve the table name or column name referenced by proofreading.
  • the logical execution plan (Resolved Logical Plan) can be obtained. Add a query node (Query Node) in the logical execution plan to query the data table to be connected through the query node.
  • mapping end link Map Join Modify the logic execution plan according to the data skew Key and the data skew policy, so that the first subtable to be joined corresponding to the data skew Key split from the data table to be joined
  • the mapping end link Map Join is performed.
  • Step 904 and step 905 in this embodiment are similar to step 301 and step 302 in the above embodiment, and will not be repeated here.
  • the logical execution plan After the logical execution plan is modified, it needs to be updated, so that subsequent steps can apply the updated logical execution plan. Pass the modified logical execution plan to the optimizer Catalyst Optimizer for optimization, and then generate an optimized logical execution plan through a series of optimizations. Spark translates this logical execution plan into a physical execution plan, checking for feasible optimization strategies, and checking for optimizations along the way.
  • the physical execution plan determines how to execute the logical plan on the cluster by generating different physical execution operations and conducting comparative analysis through the cost model.
  • Spark chooses a physical plan, Spark runs all codes on Spark's underlying programming interface RDD. Spark performs further optimizations at runtime, generates native Java bytecode that can optimize tasks or stages during execution, and finally returns the results to the user.
  • Spark executes the physical execution plan , it is possible to perform Map Join on the skewed Key of the split data corresponding to the first sub-table to be joined, which can greatly save the running time. It avoids many problems caused by data skew from the root, reduces scene restrictions, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.
  • FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention.
  • the data skew processing device 100 includes: a query module 1001 , a modification module 1002 and a generation module 1003 .
  • the query module 1001 is configured to query the data table to be connected through the query node in the logic execution plan, and obtain the key Key of the data skew.
  • a modifying module 1002 configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the data skewed key corresponding to the first
  • Map Join is performed at the mapping end.
  • the generating module 1003 is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
  • the query module is specifically used for:
  • the Key is determined as a Key with skewed data.
  • the modification module is specifically used for:
  • the modification module is specifically used for:
  • the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
  • the modification module is specifically used for: for each data-slanted Key, the first sub-table to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
  • the modification module is also used for:
  • Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  • the device further includes:
  • the syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan.
  • the parsing module is configured to parse the unparsed logical execution plan to obtain the logical execution plan.
  • a creation module is used for adding the query node in the logical execution plan.
  • the generating module is specifically used for:
  • the Map Join is a broadcast table link BroadcastHashJoin
  • the Reduce Join is a sort-merge link SortMergeJoin.
  • the data skew processing device provided by the embodiment of the present invention can be used to execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, so this embodiment will not repeat them here.
  • Fig. 11 is a schematic diagram of the hardware structure of a data tilt processing device provided by an embodiment of the present invention.
  • the device may be a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc. .
  • Device 110 may include one or more of the following components: processing component 1101 , memory 1102 , power supply component 1103 , input/output (I/O) interface 1104 , and communication component 1106 .
  • Processing component 1101 generally controls the overall operations of device 110, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 1101 may include one or more processors 1105 to execute instructions to complete all or part of the steps of the above method. Additionally, the processing component 1101 may include one or more modules to facilitate interaction between the processing component 1101 and other components. For example, processing component 1101 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 1101 .
  • Memory 1102 is configured to store various types of data to support operations at device 110 . Examples of such data include instructions for any application or method operating on device 110, contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 1102 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the power supply component 1103 provides power to various components of the device 110 .
  • Power components 1103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 110 .
  • the I/O interface 1104 provides an interface between the processing component 1101 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Communication component 1106 is configured to facilitate wired or wireless communications between device 110 and other devices.
  • the device 110 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 1106 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 1106 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wide Band
  • Bluetooth Bluetooth
  • device 110 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • the present application also provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the data skew processing method performed by the above data skew processing device is realized.
  • the above-mentioned computer-readable storage medium can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium.
  • the readable storage medium can also be a component of the processor.
  • the processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC for short).
  • ASIC Application Specific Integrated Circuits
  • the processor and the readable storage medium can also exist in the device as discrete components.
  • the aforementioned program can be stored in a computer-readable storage medium.
  • the program executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
  • An embodiment of the present invention also provides a computer program product, including a computer program.
  • the computer program is executed by a processor, the above data skew processing method performed by the data skew processing device is implemented.
  • the embodiment of the present invention also provides a chip for running instructions.
  • the chip includes a memory and a processor. Codes and data are stored in the memory.
  • the memory is coupled to the processor.
  • the processor runs the memory in the memory.
  • the code enables the chip to execute the data skew processing method performed by the above data skew processing device.
  • An embodiment of the present invention further provides a computer program, which is used to execute the data skew processing method performed by the above data skew processing device when the computer program is executed by a processor.

Abstract

A data skew processing method, a device, a storage medium, and a program product, which relate to the technical field of computers. The method comprises: by means of a query node in a logic execution plan, querying a data table to be connected to obtain a data skew key; modifying the logic execution plan according to the data skew key and a data skew policy, so that a first sub-table to be connected that corresponds to the data skew key obtained by splitting from the data table to be connected executes Map Join when Spark executes a physical execution plan; and generating the physical execution plan according to the modified logic execution plan. According to the method provided in the present embodiment, by modifying the logic execution plan, many problems caused by data skew are prevented at source, thus reducing the limitations on usage scenarios, avoiding dependence on statistical information, and improving the comprehensiveness and accuracy of data skew processing.

Description

数据倾斜处理方法、设备、存储介质及程序产品Data skew processing method, device, storage medium and program product
本申请要求于2021年09月27日提交中国专利局、申请号为202111139049.1、申请名称为“数据倾斜处理方法、设备、存储介质及程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111139049.1 and the application name "data skew processing method, equipment, storage medium and program product" submitted to the China Patent Office on September 27, 2021, the entire content of which is incorporated by reference incorporated in this application.
技术领域technical field
本发明实施例涉及计算机技术领域,尤其涉及一种数据倾斜处理方法、设备、存储介质及程序产品。The embodiments of the present invention relate to the field of computer technology, and in particular, to a data skew processing method, device, storage medium, and program product.
背景技术Background technique
对于分布式集群系统,其不同节点负责一定范围的数据存储或数据计算。通常将数据分散度不够,导致大量的数据集中到一台或者几台服务节点上的情况,称为数据倾斜。以分布式计算引擎Spark为例,Spark计算引擎进行洗牌Shuffle的时候,需要将各个节点上相同的键Key拉取到某个节点上的一个任务task来进行处理,整个Spark作业的运行进度是由运行时间最长的task决定,因此部分Key发生数据倾斜后,会使得Spark的整体计算效率降低。For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of data is not enough, leading to the concentration of a large amount of data on one or several service nodes, which is called data skew. Take the distributed computing engine Spark as an example. When the Spark computing engine performs shuffle, it needs to pull the same key Key on each node to a task task on a certain node for processing. The running progress of the entire Spark job is It is determined by the task with the longest running time, so the data skew of some keys will reduce the overall computing efficiency of Spark.
现有技术中,可以从引擎内核层面,引入自适应查询执行(Adaptive Query Execution,AQE)技术,AQE针对上述数据倾斜问题使用运行时的统计信息自动优化查询执行,动态的发现倾斜数据的数量,并且把倾斜的分区分成更小的子分区来处理。In the existing technology, Adaptive Query Execution (AQE) technology can be introduced from the engine core level. AQE uses runtime statistics to automatically optimize query execution for the above-mentioned data skew problem, and dynamically finds the amount of skewed data. And divide the skewed partition into smaller sub-partitions for processing.
然而,实现本发明过程中,发明人发现现有技术中心至少存在如下问题:通过AOE技术进行数据倾斜的优化,依赖于统计信息的准确度,且仅支持部分场景,例如仅支持同一阶段Stage中只有一个链接Join的场景,具有局限性。However, in the process of implementing the present invention, the inventors found that the prior art center has at least the following problems: the optimization of data tilt through AOE technology depends on the accuracy of statistical information, and only supports part of the scene, for example, only supports the same stage The scenario of only one link Join has limitations.
发明内容Contents of the invention
本发明实施例提供一种数据倾斜处理方法、设备、存储介质及程序产品,以提高数据倾斜处理的全面性和准确性。Embodiments of the present invention provide a data skew processing method, device, storage medium, and program product, so as to improve the comprehensiveness and accuracy of data skew processing.
第一方面,本发明实施例提供一种数据倾斜处理方法,包括:In a first aspect, an embodiment of the present invention provides a data skew processing method, including:
通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key;Query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;
根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join;Modify the logical execution plan according to the key of the data skew and the data skew strategy, so that the first subtable to be connected corresponding to the key of the data skew obtained from the split of the data table to be connected is distributed When the formula computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed;
根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
在一种可能的设计中,所述通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的Key,包括:In a possible design, the data table to be connected is queried through the query node in the logical execution plan to obtain the Key of data skew, including:
针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;
若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
在一种可能的设计中,所述根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,包括:In a possible design, modifying the logic execution plan according to the key and data skew strategy of the data skew includes:
在所述逻辑执行计划中增加以下处理步骤:Add the following processing steps in the logic execution plan:
根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表;Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;
将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
在一种可能的设计中,所述将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,包括:In a possible design, the merging of the connected first subtable to be connected and the connected second subtable to be connected includes:
通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
在一种可能的设计中,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。In a possible design, the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;In a possible design, there are multiple keys for the data tilt, multiple first sub-tables to be connected, and a one-to-one correspondence between the key for the data tilt and the first sub-table to be connected;
所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第 一待连接子表在Spark执行物理执行计划时进行Map Join,包括:The first subtable to be connected corresponding to the Key corresponding to the data tilt obtained by splitting the data table to be connected carries out Map Join when Spark executes the physical execution plan, including:
针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;In a possible design, there are multiple keys for the data skew, and at least one first subtable to be connected;
所述根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,还包括:The modifying the logic execution plan according to the key of the data skew and the data skew policy also includes:
将所述数据倾斜的Key进行分组,得到多个分组;每个分组的总数据量小于第二预设阈值;grouping the data-slanted Keys to obtain multiple groups; the total data volume of each group is less than a second preset threshold;
每个分组对应一个所述第一待连接子表;Each group corresponds to one of the first to-be-connected sub-tables;
所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join,包括:The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:
针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
在一种可能的设计中,所述通过逻辑执行计划中的查询节点对待连接数据表进行查询之前,还包括:In a possible design, before querying the data table to be connected through the query node in the logical execution plan, it also includes:
将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划;对所述未解析逻辑执行计划进行解析,得到逻辑执行计划;Parsing the structured query language SQL text into a syntax tree to generate an unparsed logic execution plan; parsing the unparsed logic execution plan to obtain a logic execution plan;
在所述逻辑执行计划中加入所述查询节点。The query node is added to the logical execution plan.
在一种可能的设计中,所述根据修改后的所述逻辑执行计划,生成所述物理执行计划,包括:In a possible design, the generating the physical execution plan according to the modified logical execution plan includes:
对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划;Updating the modified logical execution plan to obtain an updated logical execution plan;
对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划;Optimizing the updated logical execution plan to obtain the optimized logical execution plan;
将所述优化后的逻辑执行计划转换为物理执行计划。Converting the optimized logical execution plan into a physical execution plan.
在一种可能的设计中,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。In a possible design, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join of the merge end link is performed.
在一种可能的设计中,所述Map Join为广播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。In a possible design, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.
第二方面,本发明实施例提供一种数据倾斜处理设备,包括:In a second aspect, an embodiment of the present invention provides a data skew processing device, including:
查询模块,用于通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key;The query module is used to query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;
修改模块,用于根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join;A modifying module, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the first waiting list corresponding to the data skewed Key split from the data table to be connected The connection sub-table is linked to Map Join when the distributed computing engine Spark executes the physical execution plan;
生成模块,用于根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。The generating module is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
在一种可能的设计中,所述查询模块具体用于:In a possible design, the query module is specifically used for:
针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;
若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
在一种可能的设计中,所述修改模块具体用于:In a possible design, the modification module is specifically used for:
在所述逻辑执行计划中增加以下处理步骤:Add the following processing steps in the logic execution plan:
根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表;Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;
将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
在一种可能的设计中,所述修改模块具体用于:In a possible design, the modification module is specifically used for:
通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
在一种可能的设计中,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。In a possible design, the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;所述修改模块具体用于:针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。In a possible design, there are multiple Keys for the data tilt, and there are multiple first subtables to be connected, and the Keys for the data tilt are in one-to-one correspondence with the first subtable to be connected; The modification module is specifically used for: for each data-slanted Key, the first subtable to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;所述修改模块还用于:In a possible design, there are multiple keys for the data skew, and at least one first subtable to be connected; the modification module is also used for:
所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;There are multiple keys for the data skew, and at least one first subtable to be connected;
针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
在一种可能的设计中,所述设备,还包括:In a possible design, the device further includes:
语法分析模块,用于将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划;The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan;
解析模块,用于对所述未解析逻辑执行计划进行解析,得到逻辑执行计划;A parsing module, configured to parse the unparsed logical execution plan to obtain a logical execution plan;
创建模块,用于在所述逻辑执行计划中加入所述查询节点。A creation module is used for adding the query node in the logical execution plan.
在一种可能的设计中,所述生成模块具体用于:In a possible design, the generating module is specifically used for:
对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划;Updating the modified logical execution plan to obtain an updated logical execution plan;
对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划;Optimizing the updated logical execution plan to obtain the optimized logical execution plan;
将所述优化后的逻辑执行计划转换为物理执行计划。Converting the optimized logical execution plan into a physical execution plan.
在一种可能的设计中,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。In a possible design, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join of the merge end link is performed.
在一种可能的设计中,所述Map Join为广播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。In a possible design, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.
第三方面,本发明实施例提供一种数据倾斜处理设备,包括:至少一个处理器和存储器;In a third aspect, an embodiment of the present invention provides a data skew processing device, including: at least one processor and a memory;
所述存储器存储计算机执行指令;the memory stores computer-executable instructions;
所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能的设计所述的方法。The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the method described in the above first aspect and various possible designs of the first aspect.
第四方面,本发明实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能的设计所述的方法。In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the processor executes the computer-executable instructions, the above first aspect and the first Aspects of various possible designs of the described method.
第五方面,本发明实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上第一方面以及第一方面各种可能的设计所述的方法。In a fifth aspect, an embodiment of the present invention provides a computer program product, including a computer program. When the computer program is executed by a processor, the method described in the above first aspect and various possible designs of the first aspect is implemented.
第六方面,本发明实施例提供了一种运行指令的芯片,所述芯片包括存储器、处理器,所述存储器中存储代码和数据,所述存储器与所述处理器耦合,所述处理器运行所述存储器中的代码使得所述芯片用于执行上述第一方面以及第一方面各种可能的设计所述的方法。In a sixth aspect, an embodiment of the present invention provides a chip for running instructions, the chip includes a memory and a processor, codes and data are stored in the memory, the memory is coupled to the processor, and the processor runs The codes in the memory enable the chip to execute the method described in the above first aspect and various possible designs of the first aspect.
第七方面,本发明实施例提供了一种计算机程序,当所述计算机程序被处理器执行时,用于执行上述第一方面以及第一方面各种可能的设计所述的方法。In a seventh aspect, an embodiment of the present invention provides a computer program, which is used to execute the method described in the first aspect and various possible designs of the first aspect when the computer program is executed by a processor.
本实施例提供的数据倾斜处理方法、设备、存储介质及程序产品,该方法通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key,根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join,根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。本实施例提供的数据倾斜处理方法通过对逻辑执行计划进行修改,使得对倾斜Key的处理从Reduce Join优化成Map Join,从根源上规避了数据倾斜带来的诸多问题,减少了场景限制,避免了对统计信息的依赖,提高了数据倾斜处理的全面性和准确性。The data tilt processing method, device, storage medium and program product provided in this embodiment, the method queries the data table to be connected through the query node in the logic execution plan, obtains the key Key of data tilt, and according to the Key and The data skew strategy modifies the logical execution plan, so that the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected executes the physical execution plan in the distributed computing engine Spark When performing mapping end link Map Join, generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark. The data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本发明一实施例提供的两个数据表进行链接操作的流程示意图;Fig. 1 is a schematic flow chart of linking two data tables provided by an embodiment of the present invention;
图2为现有技术提供的通过散裂化倾斜Key进行链接操作的流程示意图;Fig. 2 is a schematic flow diagram of linking operations provided by the prior art through spallation tilt Key;
图3为本发明一实施例提供的数据倾斜处理方法的流程示意图;FIG. 3 is a schematic flow chart of a data skew processing method provided by an embodiment of the present invention;
图4为本发明又一实施例提供的数据倾斜处理方法的流程示意图;FIG. 4 is a schematic flowchart of a data skew processing method provided by another embodiment of the present invention;
图5为本发明又一实施例提供的两个数据表进行链接操作的流程示意图;Fig. 5 is a schematic flow chart of linking two data tables provided by another embodiment of the present invention;
图6为现有技术提供的两个数据表进行链接操作的有向无环图;Fig. 6 is the directed acyclic graph that two data tables provided by the prior art are linked and operated;
图7为本发明又一实施例提供的两个数据表进行链接操作的有向无环图;Fig. 7 is a directed acyclic graph of two data tables provided by another embodiment of the present invention for link operation;
图8为本发明又一实施例提供的两个数据表进行链接操作的有向无环 图;Fig. 8 is the directed acyclic graph that two data tables that another embodiment of the present invention provides carry out link operation;
图9为本发明再一实施例提供的数据倾斜处理方法的流程示意图;FIG. 9 is a schematic flowchart of a data skew processing method provided by yet another embodiment of the present invention;
图10为本发明一实施例提供的数据倾斜处理设备的结构示意图;FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention;
图11为本发明一实施例提供的数据倾斜处理设备的硬件结构示意图。FIG. 11 is a schematic diagram of a hardware structure of a data skew processing device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
对于分布式集群系统,其不同节点负责一定范围的数据存储或数据计算。通常量数据分散度不够,导致大量的数据集中到了一台或者几台服务节点上,称为数据倾斜。For a distributed cluster system, its different nodes are responsible for a certain range of data storage or data calculation. Usually, the dispersion of volume data is not enough, resulting in a large amount of data being concentrated on one or several service nodes, which is called data skew.
分布式计算引擎Spark进行洗牌Shuffle的时候,需要将各个节点上相同的Key拉取到某个节点上的一个task来进行处理,比如按照Key进行聚合或链接Join等操作。此时如果某个Key对应的数据量特别大的话,就会发生数据倾斜。比如大部分Key对应10条数据,但是个别Key却对应了100万条数据,那么大部分task可能就只会分配到10条数据,然后几秒钟就运行完了;但是个别task可能分配到了100万数据,要运行一两个小时。因此,整个Spark作业的运行进度是由运行时间最长的task决定的。When the distributed computing engine Spark performs shuffling, it needs to pull the same key on each node to a task on a certain node for processing, such as performing operations such as aggregation or link join according to the key. At this time, if the amount of data corresponding to a certain Key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data, then most tasks may only be allocated to 10 pieces of data, and then run in a few seconds; but individual tasks may be allocated to 1 million data, to run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.
图1为本发明一实施例提供的两个数据表进行链接操作的流程示意图;如图1所示,表1-1为学生的数学竞赛成绩,第一列数据包括学生证号ID,第二列数据为不同ID分别对应的数学竞赛成绩,表1-2为学生的英语竞赛成绩,第一列数据包括学生证号ID,第二列数据为不同ID分别对应的英语竞赛成绩,从两个表中可以发现,在表1-1中ID为001的学生的成绩条数较多,在两个表进行Join时,每个ID相当于一个Key,显然Key001对应的数据量较大,在对Key001对应的task进行处理时,相对于其他Key需要耗费更多的时间。因此,可以说Key001发生了数据倾斜。当然,此处仅是为了更加形象理解数据倾斜做的示例说明,实际应用中倾斜Key对应的数据量判定条件可以根据需要进行设定。Fig. 1 is the flow schematic diagram that two data tables provided by an embodiment of the present invention carry out link operation; As shown in Fig. The data in the column is the mathematics competition results corresponding to different IDs. Table 1-2 shows the students' English competition results. The first column of data includes the student ID number, and the second column of data is the English competition results corresponding to different IDs. From the two It can be seen from the table that the student with ID 001 in Table 1-1 has a large number of grades. When the two tables are joined, each ID is equivalent to a Key. Obviously, the amount of data corresponding to Key001 is relatively large. When the task corresponding to Key001 is processed, it takes more time than other keys. Therefore, it can be said that Key001 has data skew. Of course, this is just an example description for a more vivid understanding of data skew. In actual applications, the data volume determination conditions corresponding to skewed keys can be set as needed.
针对上述数据倾斜的问题,现有技术中,通常采用两种方式进行处理。For the above-mentioned problem of data skew, in the prior art, two ways are usually used to deal with it.
一种方式是从应用层进行处理:通过Rand等技术将倾斜的Key进行打散,即将倾斜的Key后面加上随机后缀,将原本倾斜的数据进行散裂化。如图2所示,表1-1和表2-1中的Key001通过Rand技术进行散裂化,可以得到表1-2和表2-1中的Key,001-1,001-2和001-3。散裂化后的表1-2和表2-2依据散裂化后的Key。然而,通过该种方式,一方面会破坏原本的业务逻辑,往往把简单问题复杂化,另一方面,一旦发生拉取失败Fetch Failure进行数据重算时,以图2中的Key001为例,需要重新进行散裂化,上一次散裂得到的001-1,可能在下一次散裂中变成001-5,使得同一条数据被分配到不同的数据分区,最终导致数据重复。One way is to deal with it from the application layer: use Rand and other technologies to break up the skewed key, that is, add a random suffix to the skewed key, and fragment the originally skewed data. As shown in Figure 2, Key001 in Table 1-1 and Table 2-1 is spallated by Rand technology, and Key, 001-1, 001-2 and 001- in Table 1-2 and Table 2-1 can be obtained 3. Table 1-2 and Table 2-2 after spallation are based on the Key after spallation. However, in this way, on the one hand, the original business logic will be destroyed, and simple problems will often be complicated. On the other hand, once the fetch failure occurs, when recalculating the data, taking Key001 in Figure 2 as an example, it needs to Re-sparse, the 001-1 obtained in the last spallation may become 001-5 in the next spallation, so that the same piece of data is allocated to different data partitions, eventually resulting in data duplication.
另一种方式是从引擎内核层进行处理:Spark内核引入自适应查询执行技术(Adaptive Query Execution,AQE),AQE针对上述数据倾斜问题使用运行时的统计信息自动优化查询执行,动态的发现倾斜数据的数量,并且把倾斜的分区分成更小的子分区来处理。然而,通过该种方式处理,一方面AQE技术强依赖运行时的统计信息,如果统计信息存在不准确情况会导致数据倾斜误判或遗漏,另一方面,AQE技术仅支持同一个阶段Stage里面只有一个Join的场景,对于多表Join的场景并不支持,并且对于Join后有Shuffle的场景也不会触发AQE数据倾斜的优化逻辑,再一方面,基于AQE处理数据倾斜,本身定位是分区Partition粒度的倾斜治理,如果倾斜数据是同一个Mapper产生的,则无法解决。Another way is to process from the engine kernel layer: Spark kernel introduces adaptive query execution technology (Adaptive Query Execution, AQE), AQE uses runtime statistics to automatically optimize query execution for the above data skew problem, and dynamically finds skewed data , and divide the skewed partitions into smaller subpartitions for processing. However, in this way, on the one hand, AQE technology strongly relies on runtime statistical information. If the statistical information is inaccurate, it will lead to misjudgment or omission of data skew; on the other hand, AQE technology only supports only A join scenario does not support the multi-table join scenario, and the AQE data skew optimization logic will not be triggered for the scenario with a shuffle after the join. On the other hand, AQE-based data skew processing is positioned at the Partition granularity. Skewed governance, if the skewed data is generated by the same Mapper, it cannot be resolved.
由此可见,无论是从应用层进行的优化,还是基于Spark内核引入AQE技术进行的优化,都有其缺陷。针对上述技术问题,发明人研究发现,AQE技术属于对物理执行计划Physical Plan进行的优化,在Spark SQL整个执行计划中,主要包括:逻辑执行计划Logical Plan和物理执行计划Physical Plan,后者是由前者转化得到的,也即如果从逻辑执行计划入手,对数据倾斜进行优化处理,那么可以从根源上解决数据倾斜的问题,避免了从应用层或物理执行计划层面的缺陷。基于此,本发明实施例提供了一种数据倾斜处理方法,通过对逻辑执行计划进行修改,来有效划分数据分片,将倾斜key从Reduce Join优化成Map Join以提升数据处理能力,从根源上规避了数据倾斜带来的诸多问题,减少了场景限制,避免了对统计信息的依赖,提高了数据倾斜处理的全面性和准确性。It can be seen that both the optimization from the application layer and the optimization based on the introduction of AQE technology based on the Spark kernel have their defects. Aiming at the above technical problems, the inventor found that the AQE technology belongs to the optimization of the Physical Plan of the physical execution plan. In the entire execution plan of Spark SQL, it mainly includes: Logical Plan of execution plan and Physical Plan of physical execution plan. The latter is composed of The transformation of the former means that if we start with the logical execution plan and optimize the data skew, we can solve the problem of data skew from the root and avoid defects at the application layer or physical execution plan level. Based on this, the embodiment of the present invention provides a data skew processing method. By modifying the logical execution plan, data fragmentation is effectively divided, and the skew key is optimized from Reduce Join to Map Join to improve data processing capabilities. It avoids many problems caused by data skew, reduces scene restrictions, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.
下面以具体地实施例对本发明的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solution of the present invention will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
图3为本发明一实施例提供的数据倾斜处理方法的流程示意图。如图3所示,该方法包括:FIG. 3 is a schematic flowchart of a data skew processing method provided by an embodiment of the present invention. As shown in Figure 3, the method includes:
301、通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key。301. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.
本实施例中,待连接数据表可以至少包括两个数据表,例如,可以包括第一数据表和第二数据表,通过逻辑执行计划中的查询节点可以分别对第一数据表和第二数据表分别进行查询,还可以包括第一数据表、第二数据表和第三数据表等三个以上的数据表,在查询过程中可以依次对第一数据表、第二数据表、第三数据表等进行查询。本实施例对待连接数据表的数量以及查询顺序不做限定。In this embodiment, the data table to be connected may include at least two data tables, for example, it may include the first data table and the second data table, and the first data table and the second data table may be respectively processed by the query node in the logical execution plan. Tables are queried separately, and can also include more than three data tables, such as the first data table, the second data table, and the third data table. During the query process, the first data table, the second data table, and the third data table can be sequentially Table etc. to query. In this embodiment, the number of data tables to be connected and the query order are not limited.
可选地,在一些实施例中,所述通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的Key,包括:针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。Optionally, in some embodiments, the querying the data table to be connected through the query node in the logical execution plan to obtain the key of the data skew includes: for each Key of the data table to be connected, the The data amount corresponding to the Key is compared with a first preset threshold; if the data amount corresponding to the Key is greater than the first preset threshold, the Key is determined as a key with data skew.
具体的,在通过逻辑执行计划中的查询节点对待连接数据表中的各个数据表进行查询过程中,可以针对每个Key进行查询,如图1所示,假设表1-1和表1-2为待连接数据表,在一种可实现方式中,可以首先对表1-1中的Key001进行查询,再依次查询Key002和Key003等其余Key,再对表2-1中的Key001进行查询,再依次查询Key002和Key003等其余Key。在另一种可实现方式中,可以首先将不同Key对应的条数进行排序,形成新的序列,按照序列中的顺序,从对应条数多到对应条数少,依次对相应的Key进行查询。此种方式有利于尽快找到倾斜的Key,且如果条数少到一定程度,不足以成为倾斜Key的时候,可以及时中止查询,节省运算量。对于具体采用何种查询方式,可根据实际需要确定,本实施例对此不进行限定。Specifically, in the process of querying each data table in the data table to be connected through the query node in the logical execution plan, the query can be performed for each Key, as shown in Figure 1, assuming Table 1-1 and Table 1-2 For the data table to be connected, in a feasible way, you can first query Key001 in Table 1-1, and then query other keys such as Key002 and Key003 in turn, and then query Key001 in Table 2-1, and then Query other keys such as Key002 and Key003 in turn. In another achievable way, you can first sort the number of entries corresponding to different keys to form a new sequence, and query the corresponding keys in turn according to the order in the sequence, from the number of corresponding entries to the number of corresponding entries. . This method is conducive to finding the skewed key as soon as possible, and if the number of keys is too small to become a skewed key, the query can be stopped in time to save the amount of computation. The specific query mode to be used may be determined according to actual needs, which is not limited in this embodiment.
在对每个Key进行查询过程中,可以将该Key对应的数据量与第一预设阈值进行比较。该Key对应的数据量大于第一预设阈值时,将该Key确 定为倾斜的Key。此处的数据量可以为数据的容量的大小,例如多少兆、多少G,还可以为条数。第一预设阈值可以是根据经验确定的一个固定值。In the process of querying each Key, the amount of data corresponding to the Key may be compared with a first preset threshold. When the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as an inclined Key. The amount of data here may be the size of the data capacity, such as how many megabytes, how many G, or the number of records. The first preset threshold may be a fixed value determined empirically.
302、根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join。302. Modify the logic execution plan according to the data skew key and the data skew policy, so that the first subtable to be joined corresponding to the data skew key split from the data table to be joined When the distributed computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed.
本实施例中,通过基于数据倾斜Key和数据倾斜策略对逻辑执行计划进行修改,使得从待连接数据表中拆分得到的数据倾斜Key对应的第一待连接字表在后续的物理执行计划被Spark计算引擎执行时能够进行映射端连接。In this embodiment, the logical execution plan is modified based on the data skew Key and the data skew strategy, so that the first word table to be connected corresponding to the data skew Key obtained by splitting from the data table to be connected will be processed in the subsequent physical execution plan. The Spark computing engine can connect to the mapping side during execution.
映射端链接Map Join和归并端链接Reduce Join均是为了将待连接数据表进行Join操作。也即,不同数据源数据的合并操作。其中,Reduce Join是在Map阶段完成数据的标记,在Reduce阶段完成数据的合并。Map Join是直接在Map阶段完成数据的合并,没有Reduce阶段。Both Map Join at the mapping end and Reduce Join at the merging end are used to join the data tables to be joined. That is, the merge operation of data from different data sources. Among them, Reduce Join is to complete the marking of data in the Map stage, and complete the merging of data in the Reduce stage. Map Join is to complete the merging of data directly in the Map stage, without the Reduce stage.
具体的,如图1所示的表1-1和表2-1,以此为例,对Reduce Join过程示例说明,在Reduce Join的Map阶段,会将输入数据统一封装为一个Bean,此Bean包含了表1-1和表2-1的所有公共和非公共属性,相当于进行了全外连接,并新增加一个属性,文件名称,以区分数据是来自于表1-1还是表2-1,便于在Reduce阶段数据的处理;Map输出的Key是学生证号ID,Value是Bean。在Shuffle阶段,根据ID对Bean进行排序,所有ID相同的数据都会被聚合到同一个key下,发往同一个Reduce task;在Reduce过程中,对同一个ID下所有的Bean,首先要区分出来源,是表1-1还是表2-1。而如果对表1-1和表2-1进行Map Join的话,没有Reduce过程,所有的工作都在Map阶段完成,极大减少了网络传输和输入输出的代价。具体实现过程中:可以将表1-1或表2-1,例如将表2-1预先缓存于各个Map task节点,然后等到表1-1的数据传输过来时,直接用表1-1的数据连接已经预存的表2-1的数据并输出即可。Specifically, as shown in Table 1-1 and Table 2-1 in Figure 1, take this as an example to illustrate the Reduce Join process. In the Map stage of Reduce Join, the input data will be uniformly encapsulated into a Bean. This Bean Contains all public and non-public attributes of Table 1-1 and Table 2-1, which is equivalent to a full outer connection, and adds a new attribute, the file name, to distinguish whether the data comes from Table 1-1 or Table 2- 1. It is convenient for data processing in the Reduce phase; the Key output by the Map is the ID of the student ID, and the Value is the Bean. In the Shuffle phase, the beans are sorted according to the ID, and all the data with the same ID will be aggregated under the same key and sent to the same Reduce task; Source, whether it is Table 1-1 or Table 2-1. However, if Map Join is performed on Table 1-1 and Table 2-1, there is no Reduce process, and all work is completed in the Map phase, which greatly reduces the cost of network transmission and input and output. In the specific implementation process: you can pre-cache Table 1-1 or Table 2-1, for example, Table 2-1 in each Map task node, and then directly use Table 1-1 when the data in Table 1-1 is transmitted. Just connect the pre-stored data in Table 2-1 and output the data.
本实施例中,以图1中的表1-1和表2-1为例,数据倾斜的Key为001的话,那么该第一待连接子表,包括来自于表1-1的001对应的第一子数据表,来自于表2-1的001对应的第二子数据表。所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算 引擎Spark执行物理执行计划时进行映射端链接Map Join,可以是在Sark执行物理执行计划时,将第一子数据表和第二子数据表进行Map Join,具体的,可以将第一子数据表或第二子数据表,例如将第二子数据表预先缓存于各个Map task节点,然后等到第一子数据表的数据传输过来时,直接用第一子数据表的数据连接已经预存的第二子数据表的数据并输出即可。In this embodiment, taking table 1-1 and table 2-1 in Fig. 1 as an example, if the Key of the data tilt is 001, then the first sub-table to be connected includes the key corresponding to 001 from table 1-1 The first sub-data table comes from the second sub-data table corresponding to 001 in Table 2-1. The first to-be-connected sub-table corresponding to the Key of the data tilt obtained by splitting from the data table to be connected performs a mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, which can be in Sark When executing the physical execution plan, map join the first sub-data table and the second sub-data table, specifically, the first sub-data table or the second sub-data table, for example, pre-cache the second sub-data table in each Map task node, and when the data of the first sub-data table is transmitted, directly connect the pre-stored data of the second sub-data table with the data of the first sub-data table and output it.
303、根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。303. Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan by using Spark.
本实施例中,Spark对代码中API的执行主要包括以下步骤:首先编写DataFrame/Dataset/SQL代码;其次,如果编写的代码没有错误,Spark会将这些代码转换成逻辑执行计划;再次,Spark会将生成的逻辑执行计划进行一系列优化后,将优化后的逻辑执行计划转换为物理执行计划;最后,Spark执行该物理执行计划,也即在集群上对弹性分布式数据集(Resilient Distributed Datasets,RDD)进行一系列操作。In this embodiment, Spark's execution of the API in the code mainly includes the following steps: first write the DataFrame/Dataset/SQL code; secondly, if there is no error in the written code, Spark will convert these codes into a logical execution plan; again, Spark will After performing a series of optimizations on the generated logical execution plan, the optimized logical execution plan is converted into a physical execution plan; finally, Spark executes the physical execution plan, that is, the Resilient Distributed Datasets (Resilient Distributed Datasets, RDD) to perform a series of operations.
本实施例中,在对逻辑执行计划根据数据倾斜的Key和数据倾斜策略修改完成后,可以基于该修改后的逻辑执行计划生成物理执行计划。当然,待修改的逻辑执行计划可以是优化后的逻辑执行计划,或者是优化前的逻辑执行计划,本实施例对此不进行限定。In this embodiment, after the logical execution plan is modified according to the key of the data skew and the data skew policy, a physical execution plan may be generated based on the modified logical execution plan. Certainly, the logical execution plan to be modified may be an optimized logical execution plan or an unoptimized logical execution plan, which is not limited in this embodiment.
本实施例提供的数据倾斜处理方法,通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key,根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join,根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。本实施例提供的数据倾斜处理方法通过对逻辑执行计划进行修改,使得对倾斜Key的处理从Reduce Join优化成Map Join,从根源上规避了数据倾斜带来的诸多问题,减少了场景限制,避免了对统计信息的依赖,提高了数据倾斜处理的全面性和准确性。In the data skew processing method provided in this embodiment, the data table to be connected is queried through the query node in the logical execution plan to obtain the key Key of data skew, and the logical execution plan is executed according to the key of the data skew and the data skew strategy. Modify, so that the first subtable to be connected corresponding to the Key corresponding to the data tilt obtained from the data table to be split is carried out the mapping end link Map Join when the distributed computing engine Spark executes the physical execution plan, according to the modification After the logical execution plan, generate the physical execution plan, so as to execute the physical execution plan through Spark. The data skew processing method provided in this embodiment optimizes the processing of skewed Keys from Reduce Join to Map Join by modifying the logical execution plan, which avoids many problems caused by data skew from the root, reduces scene restrictions, and avoids It reduces the dependence on statistical information and improves the comprehensiveness and accuracy of data skew processing.
图4为本发明又一实施例提供的数据倾斜处理方法的流程示意图。如图4所示,在上述实施例的基础上,例如在图3所示实施例的基础上,本实施例中对如何对逻辑修改计划进行修改进行了详细说明,该方法包括:FIG. 4 is a schematic flow chart of a data skew processing method provided by another embodiment of the present invention. As shown in FIG. 4 , on the basis of the above embodiments, for example, on the basis of the embodiment shown in FIG. 3 , how to modify the logic modification plan is described in detail in this embodiment, and the method includes:
401、通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得 数据倾斜的键Key。401. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.
本实施例中,步骤401与上述实施例中步骤301相类似,此处不再赘述。In this embodiment, step 401 is similar to step 301 in the foregoing embodiment, and will not be repeated here.
402、在所述逻辑执行计划中增加根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表的处理步骤,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join。402. Add the key according to the data skew to the logic execution plan, split the data in the data table to be connected, and obtain the first subtable to be joined and the second subtable corresponding to the data non-slanted key. The processing step of the sub-table to be connected, so that the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected is mapped when the distributed computing engine Spark executes the physical execution plan Link Map Join.
403、在所述逻辑执行计划中增加将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表的处理步骤,得到修改后的所述逻辑执行计划。403. Add a processing step of merging the connected first subtable to be connected and the second connected subtable to obtain the final data table in the logic execution plan, and obtain the modified The logical execution plan.
在一些实施例中,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。也就是说,待连接数据表中的任何一个数据表都有可能被判定存在数据倾斜的Key。如图1所示,两个数据表中只有一个数据表存在数据倾斜的Key,即表1-1中的Key001。而如图5所示,两个数据表中均存在数据倾斜的Key,即表1-1中数据倾斜的Key是001,表4-1中数据倾斜的Key是003。本实施例中对数据倾斜的Key的来源以及数量不做限定。In some embodiments, the data tables to be connected include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table. That is to say, any data table in the data table to be connected may be determined to have a key with data skew. As shown in Figure 1, only one of the two data tables has a key with skewed data, that is, Key001 in Table 1-1. As shown in Figure 5, there are keys with skewed data in both data tables, that is, the key with skewed data in Table 1-1 is 001, and the key with skewed data in Table 4-1 is 003. In this embodiment, there is no limitation on the source and quantity of keys for data skew.
为了更加形象的对本实施例中的对逻辑执行计划的修改操作,以下结合图5对待连接数据表表1-1和表4-1的连接过程进行示例说明。In order to more vividly modify the logic execution plan in this embodiment, the connection process of the data tables to be connected Table 1-1 and Table 4-1 will be illustrated below in conjunction with FIG. 5 .
如图5所示,表1-1中的数据倾斜的Key是001,表4-1中的数据倾斜的Key是003。因此,在对表1-1进行拆分后,得到包括数据倾斜Key001和003对应数据的表1-1-1,和包括数据非倾斜Key002的表1-1-2;在对表4-1进行拆分后,得到包括数据倾斜Key001和003对应数据的表4-1-1,和包括数据非倾斜Key002的表4-1-2,也就是说,在对待连接数据表表1-1和表4-1根据数据倾斜的Key进行拆分后,得到表1-1-1和表4-1-1组成的第一待连接子表,以及由表1-1-2和表4-1-2组成的第二待连接子表。图5中的表1-1和表4-1仅为示例说明,为了讲述数据表连接过程仅展示了001至003,3个Key。在实际的数据表中,Key的数据量可以达到万数计或千万级别,也就是说,数据非倾斜的Key会很多,数据倾斜Key的占比会非常小,在此情况下,第一待连接子表中的表1-1-1和表4-1-1均作为小表在 Spark执行物理执行计划时可以被执行Map Join,而第二待连接子表中的表1-1-2和表4-1-2均作为大表在Spark执行物理执行计划时可以被执行Reduce Join。Map Join和Reduce Join的实现过程可以参照步骤302的描述,此处不再赘述。As shown in Figure 5, the key of the data skew in Table 1-1 is 001, and the key of the data skew in Table 4-1 is 003. Therefore, after splitting Table 1-1, get Table 1-1-1 including data corresponding to data skew Key001 and 003, and Table 1-1-2 including data non-skew Key002; in Table 4-1 After splitting, get Table 4-1-1 including data corresponding to data skew Key001 and 003, and Table 4-1-2 including data non-skewed Key002, that is, in the table 1-1 and After Table 4-1 is split according to the key of the data skew, the first sub-table to be connected is obtained consisting of Table 1-1-1 and Table 4-1-1, and Table 1-1-2 and Table 4-1 The second to-be-connected subtable composed of -2. Table 1-1 and Table 4-1 in Figure 5 are just examples, and only 3 Keys from 001 to 003 are shown in order to describe the data table connection process. In an actual data table, the data volume of Keys can reach tens of thousands or tens of millions. Table 1-1-1 and Table 4-1-1 in the sub-table to be joined are both small tables that can be used for Map Join when Spark executes the physical execution plan, while Table 1-1-1 in the second sub-table to be joined 2 and Table 4-1-2 are large tables that can be used for Reduce Join when Spark executes the physical execution plan. The implementation process of Map Join and Reduce Join can refer to the description of step 302, and will not be repeated here.
第一待连接子表表1-1-1和表4-1-1进行Map Join后,得到表5-1,第二待连接子表表1-1-2和表4-1-2进行Reduce Join后,得到表5-2,表5-1和表5-2进行合并后,得到最终的数据表表5。在一些实施例中,所述将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,包括:通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。After Map Join is performed on the first subtable to be connected 1-1-1 and table 4-1-1, table 5-1 is obtained, and the second subtable to be connected is performed on table 1-1-2 and table 4-1-2 After Reduce Join, Table 5-2 is obtained, and Table 5-1 and Table 5-2 are merged to obtain the final data table Table 5. In some embodiments, the merging the connected first to-be-connected sub-table and the connected second to-be-connected sub-table includes: using a Union operator to combine the connected first to-be-connected sub-table The to-be-connected sub-table is merged with the connected second to-be-connected sub-table to obtain a final data table.
在一些实施例中,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。可选地,所述Map Join为广播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。In some embodiments, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join is performed. Optionally, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.
实际应用中,第一待连接子表的Map Join的方式有多种,以下结合图7至图8对本实施例中对逻辑执行计划进行修改后,第一待连接子表在Spark执行物理执行计划时进行Map Join的方式进行示例说明,并且结合图6,对采用本实施例提供的数据倾斜处理方法前后的优势对比进行说明。In practical applications, there are many ways of Map Join of the first subtable to be connected. After modifying the logical execution plan in this embodiment in combination with FIG. 7 to FIG. 8, the physical execution plan of the first subtable to be connected is executed in Spark The method of performing Map Join is given as an example, and in conjunction with FIG. 6 , the comparison of advantages before and after adopting the data skew processing method provided by this embodiment is described.
为了保证数据倾斜的Key对应的第一待连接子表均能实现Map Join,在一些实施例中,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join,包括:针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。如图8所示,以图5中待连接数据表表1-1-1和表4-1-1包括两个数据倾斜Key为例,每个数据倾斜的Key对应一个第一待连接子表,即在对表1-1进行拆分时,可以拆分得到两个第一待连接子表,第一个第一待连接子表仅包括Key001的数据,而另一个待连接子表仅包括Key002的数据。在后续的Map Join过程中,如图8所示,Stage1和Stage2针对一个第一待连接子表,Stage3和Stage4针对另一个第一待连接子表,具体的,均通过BroadcastHashJoin实现数据倾斜的Key的Map Join。而在图8所示的有向无环图中Stage5和Stage6则是针对第二待连接子表,在经过Stage5 和Stage6的处理后,通过SortMergeJoin实现数据非倾斜的Key的Reduce Join。在Stage7中,将执行了BroadcastHashJoin的两个第一待连接子表和一个执行了SortMergeJoin的第二待连接子表进行合并Union,得到最终的数据表。In order to ensure that the first subtable to be connected corresponding to the Key of the data tilt can realize Map Join, in some embodiments, the Key of the data tilt is multiple, the first subtable to be connected is multiple, and the The key of the data skew is in one-to-one correspondence with the first subtable to be connected; the first subtable to be connected corresponding to the key of the data skew obtained by splitting the data table to be connected is physically executed in Spark Map Join is performed during planning, including: for each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan. As shown in Figure 8, take the table 1-1-1 and table 4-1-1 of the data table to be connected in Figure 5 as an example, and each key of the data slope corresponds to a first child table to be connected , that is, when splitting Table 1-1, two first sub-tables to be connected can be obtained by splitting. The first sub-table to be connected only includes the data of Key001, while the other sub-table to be connected only includes Data of Key002. In the subsequent Map Join process, as shown in Figure 8, Stage1 and Stage2 target a first sub-table to be joined, and Stage3 and Stage4 target another first sub-table to be joined. Specifically, both realize the key of data skew through BroadcastHashJoin The Map Join. In the directed acyclic graph shown in Figure 8, Stage5 and Stage6 are for the second sub-table to be joined. After processing by Stage5 and Stage6, the Reduce Join of the key with non-slanted data is realized through SortMergeJoin. In Stage7, the two first to-be-joined sub-tables executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table.
为了能够减少合并操作的计算量,且减少用于处理第一待连接子表的计算节点的数量,在一些实施例中,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;所述根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,还包括:将所述数据倾斜的Key进行分组,得到多个分组;每个分组的总数据量小于第二预设阈值;每个分组对应一个所述第一待连接子表;所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join,包括:针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。以图5中待连接数据表表1-1-1和表4-1-1包括两个数据倾斜Key为例,可以将数据倾斜的Key001和003基于第二预设阈值进行分组,例如,假设001和003对应的数据量的总和小于第二预设阈值,则可以将001和003如图5中所示分到一个分组,而且待连接数据表有且仅有这一个分组。如图7所示,针对该一个分组,可以得到一个第一待连接子表,针对该一个待连接子表,可以通过执行Stage1和Stage2的处理后,进行BroadcastHashJoin来实现第一待连接子表的Map Join。而在图7所示的有向无环图中Stage3和Stage4则是针对第二待连接子表,在经过Stage3和Stage4的处理后,通过SortMergeJoin实现数据非倾斜的Key的Reduce Join。在Stage5中,将执行了BroadcastHashJoin的第一待连接子表和执行了SortMergeJoin的第二待连接子表进行合并Union,得到最终的数据表。本实施例中,第二预设阈值的设定可以根据经验进行设定,本实施例对此不做限定。In order to reduce the calculation amount of the merging operation and reduce the number of computing nodes used to process the first subtable to be connected, in some embodiments, there are multiple keys for the data skew, and the first subtable to be connected It is at least one; modifying the logic execution plan according to the key of the data tilt and the data tilt strategy, further comprising: grouping the keys of the data tilt to obtain multiple groups; the total data of each group The amount is less than the second preset threshold; each group corresponds to one of the first sub-tables to be connected; the first sub-table to be connected corresponding to the Key of the data skew obtained from the splitting of the data table to be connected Performing Map Join when Spark executes the physical execution plan includes: for each group, performing Map Join on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan. Taking the data table 1-1-1 and table 4-1-1 to be connected in Figure 5 as an example including two data skew Keys, the data skew Keys 001 and 003 can be grouped based on the second preset threshold, for example, assuming If the sum of the data amounts corresponding to 001 and 003 is less than the second preset threshold, 001 and 003 can be divided into one group as shown in FIG. 5 , and the data table to be connected has and only has this one group. As shown in Figure 7, for this grouping, a first subtable to be connected can be obtained, and for this subtable to be connected, after performing the processing of Stage1 and Stage2, BroadcastHashJoin can be performed to realize the first subtable to be connected Map Join. In the directed acyclic graph shown in Figure 7, Stage3 and Stage4 are for the second sub-table to be joined. After processing by Stage3 and Stage4, the Reduce Join of the key with non-slanted data is realized through SortMergeJoin. In Stage5, the first to-be-joined sub-table executed by BroadcastHashJoin and the second to-be-joined sub-table executed by SortMergeJoin are merged into Union to obtain the final data table. In this embodiment, the setting of the second preset threshold may be set according to experience, which is not limited in this embodiment.
可以理解,如果待连接数据表中的数据倾斜的Key有多个,例如100个,则可以通过多种方式,将这100个数据倾斜的Key基于第二预设阈值进行分组。在一种可实现方式中,可以将数据倾斜的Key按照Key的编号数值进行排序,然后针对排序中的第一个Key的数据量基于第二预设阈值进行判断,若小于第二预设阈值,则将第一个和第二个Key的数据总量基于第二预设阈值进行判断,若仍小于第二预设阈值,则将第一个、第二个 和第三个Key的数据总量基于第二预设阈值进行判断,直至超过第二预设阈值,则将第N-1个之前的Key划入一个分组内,从第N个开始继续进行上述步骤的判断。直至遍历排序中的所有Key。在另一种可实现方式中,可以将数据倾斜的Key按照每个Key对应的数据量进行排序。然后再基于第二预设阈值对排序中的Key进行分组。本实施例中对此不进行限定,可根据实际需要进行选择。It can be understood that if there are multiple, for example 100, data skewed keys in the data table to be connected, the 100 data skewed keys can be grouped based on the second preset threshold in various ways. In a practicable way, the keys with data skew can be sorted according to the number of the key, and then the data volume of the first key in the sorting can be judged based on the second preset threshold, if it is less than the second preset threshold , the total amount of data of the first and second Key is judged based on the second preset threshold, and if it is still less than the second preset threshold, the total amount of data of the first, second, and third Key is Quantity is judged based on the second preset threshold, until it exceeds the second preset threshold, then the Key before the N-1th is classified into a group, and the judgment of the above steps is continued from the Nth. Until all keys in the sorting are traversed. In another practicable manner, the Keys with skewed data may be sorted according to the amount of data corresponding to each Key. Then the keys in the sorting are grouped based on the second preset threshold. This is not limited in this embodiment, and can be selected according to actual needs.
图6为现有技术中提供的两个数据表进行链接操作的有向无环图。图7为本发明又一实施例提供的两个数据表进行链接操作的有向无环图。图8为本发明又一实施例提供的两个数据表进行链接操作的有向无环图。如图6所示,阶段1Stage1和阶段2Stage2均依次包括以下步骤:范围Range、投影Project、交换Exchange、自定义读取CustomShuffleReader和排序Sort;阶段3Stage3包括排序合并链接SortMergeJoin。如图7所示,阶段1Stage1包括以下步骤:范围Range、过滤Filter、投影Project、交换Exchange、自定义读取CustomShuffleReader和广播交换BroadcastExchage;阶段2Stage2包括以下步骤:范围Range、过滤Filter、投影Project、交换Exchange、自定义读取CustomShuffleReader;阶段3Stage3和阶段4Stage4均包括范围Range、过滤Filter、投影Project、交换Exchange、自定义读取CustomShuffleReader和排序Sort;阶段5包括以下步骤:广播表链接BroadcastHashJoin、排序合并链接SortMergeJoin、合并Union和自适应Spark执行计划AdaptiveSparkPlan。FIG. 6 is a directed acyclic graph of two data tables provided in the prior art for link operation. Fig. 7 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention. Fig. 8 is a directed acyclic graph of linking operation of two data tables provided by another embodiment of the present invention. As shown in Figure 6, both stage 1Stage1 and stage 2Stage2 include the following steps in sequence: range Range, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort; stage 3Stage3 includes sort merge link SortMergeJoin. As shown in Figure 7, stage 1Stage1 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and broadcast exchange BroadcastExchage; stage 2Stage2 includes the following steps: range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader; stage 3Stage3 and stage 4Stage4 both include range Range, filter Filter, projection Project, exchange Exchange, custom read CustomShuffleReader and sort Sort; stage 5 includes the following steps: broadcast table link BroadcastHashJoin, sort merge link SortMergeJoin, merge Union and adaptive Spark execution plan AdaptiveSparkPlan.
下面还是以图5所示的表1-1和表4-1为例对图6至图8进行示例说明,如图6所示,两个数据表进行链接,通常是对表1-1通过Stage1的处理,对表4-1通过Stage2的处理后,在Stage3进行SortMergeJoin实现Reduce Join。如图7所示,对表1-1-2进行Stage3的处理,对表4-1-1进行Stage4的处理后,进行Stage5中SortMergeJoin实现ReduceJoin,得到表5-2,对表1-1-1和表4-1-2中其中一个表进行Stage1的处理,对另一个表进行Stage2的处理后,进行Stage5中的BroadcastHashJoin实现MapReduce,得到表5-1,最后将表5-1和表5-2进行Union后,进行自适应Spark执行计划的执行。图8相对于图7来说是将数据倾斜的数据拆分为了多组子表,也即在数据倾斜Key数量太多的情况下将表1-1-1再进行细化拆分后得到多个下一级子表,相对应对表4-1-2也拆分得到多个下一级子表,然后针 对每组下一级子表(包括表1-1-1的一个下一级子表,以及该下一级子表对应的表4-1-2的下一级子表),进行Stage1和Stage2的处理后进行BroadcastHashJoin实现多个MapJoin。最终通过Stage7的处理将多个BroadcastHashJoin得到的表和SortMergeJoin得到的表进行合并后,进行自适应Spark执行计划的执行。The following still uses Table 1-1 and Table 4-1 shown in Figure 5 as an example to illustrate Figure 6 to Figure 8. As shown in Figure 6, the two data tables are linked, usually through Table 1-1. For the processing of Stage1, after Table 4-1 is processed by Stage2, SortMergeJoin is performed in Stage3 to realize Reduce Join. As shown in Figure 7, perform Stage3 processing on Table 1-1-2, and perform Stage4 processing on Table 4-1-1, then perform SortMergeJoin in Stage5 to implement ReduceJoin to obtain Table 5-2, and perform Table 1-1- 1 and one of the tables in Table 4-1-2 is processed by Stage1, and after the other table is processed by Stage2, the BroadcastHashJoin in Stage5 is implemented to implement MapReduce, and Table 5-1 is obtained, and finally Table 5-1 and Table 5 are combined -2 After the Union is performed, execute the adaptive Spark execution plan. Compared with Figure 7, Figure 8 splits the data with skewed data into multiple groups of sub-tables, that is, when the number of keys with skewed data is too large, table 1-1-1 is further refined and split to obtain multiple sub-tables. Sub-tables of the next level, corresponding to Table 4-1-2 is also split to obtain multiple sub-tables, and then for each group of sub-tables (including a sub-sub-table of Table 1-1-1 table, and the next-level sub-table of Table 4-1-2 corresponding to the lower-level sub-table), after processing Stage1 and Stage2, perform BroadcastHashJoin to realize multiple MapJoins. Finally, through the processing of Stage7, multiple tables obtained by BroadcastHashJoin and tables obtained by SortMergeJoin are merged, and then the adaptive Spark execution plan is executed.
采用本发明实施例之前,即优化前如图6所示处理方式中,Spark任务运行时间是5.5分钟,采用本发明实施例之后,即优化后如图7或图8所示处理方式中,Spark任务运行时间是2.1分钟,整体性能提升60%。Before adopting the embodiment of the present invention, that is, in the processing mode shown in Figure 6 before optimization, the Spark task running time is 5.5 minutes, after adopting the embodiment of the present invention, that is, in the processing mode shown in Figure 7 or Figure 8 after optimization, Spark The task runtime is 2.1 minutes and the overall performance is improved by 60%.
404、根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。404. Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
本实施例中步骤404与上述实施例中步骤303相类似,此处不再赘述。Step 404 in this embodiment is similar to step 303 in the above embodiment, and will not be repeated here.
本发明实施例提供的数据倾斜处理方法,通过对逻辑执行计划进行修改,对数据倾斜的Key和非数据倾斜的Key进行拆分,使得Spark在执行物理执行计划时,能够使数据倾斜的Key对应的第一待连接子表进行Map Join,大大减少了运行时间。避免了数据倾斜的Key运行时间过长影响整个Spark任务的运行效率。并且该方式没有场景限制,从根源上解决了数据倾斜的问题。The data skew processing method provided by the embodiment of the present invention splits the key of data skew and the key of non-data skew by modifying the logical execution plan, so that when Spark executes the physical execution plan, the Key of data skew can correspond to Map Join is performed on the first child table to be connected, which greatly reduces the running time. The long running time of the key that avoids data skew affects the running efficiency of the entire Spark task. And this method has no scene restrictions, which solves the problem of data skew from the root.
图9为本发明再一实施例提供的数据倾斜处理方法的流程示意图。如图9所示,在上述实施例的基础上,例如在图3所示实施例的基础上,本实施例中对逻辑执行计划的生成过程,以及由修改后的逻辑执行计划生成物理执行计划的过程进行了详细说明。该方法包括:FIG. 9 is a schematic flow chart of a data skew processing method provided by yet another embodiment of the present invention. As shown in FIG. 9, on the basis of the above-mentioned embodiments, for example, on the basis of the embodiment shown in FIG. 3, the generation process of the logical execution plan in this embodiment and the generation of the physical execution plan from the modified logical execution plan process is described in detail. The method includes:
901、将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划。901. Parse the structured query language SQL text into a syntax tree, and generate an unparsed logical execution plan.
902、对所述未解析逻辑执行计划进行解析,得到逻辑执行计划。902. Analyze the unparsed logic execution plan to obtain a logic execution plan.
903、在所述逻辑执行计划中加入所述查询节点。903. Add the query node to the logical execution plan.
本实施例中,逻辑执行计划主要是一系列抽象的转换过程。并不涉及执行器或驱动器,他只是将用户的表达式集合转换为最优的版本。具体的,用户的代码首先会被转换成未解析逻辑执行计划(Unresolved Logical Plan),未解析逻辑计划之所以被称为未解析的,是因为未解析逻辑执行计划并一定是正确的,未解析逻辑执行计划所引用到的表名或者列名可能存在,也可能不存在。Spark之后会使用目录Catalog,一个含有所有数据表table 和数据框架DataFrame的元数据仓库,在解析器(Analyser)中来解析校对所引用的表名或者列名。假如未解析逻辑执行计划通过了验证,则可以得到逻辑执行计划(Resolved Logical Plan)。在逻辑执行计划中增加查询节点(Query Node),以通过该查询节点来对待连接数据表进行查询。In this embodiment, the logical execution plan is mainly a series of abstract conversion processes. No executors or drivers are involved, it just translates the user's set of expressions into an optimal version. Specifically, the user's code will first be converted into an unresolved logical execution plan (Unresolved Logical Plan). The reason why the unresolved logical plan is called unresolved is because the unresolved logical execution plan is not necessarily correct. The table name or column name referenced by the logical execution plan may or may not exist. Spark will then use the catalog Catalog, a metadata warehouse containing all data tables and data frames DataFrame, in the parser (Analyser) to resolve the table name or column name referenced by proofreading. If the unresolved logical execution plan passes the verification, the logical execution plan (Resolved Logical Plan) can be obtained. Add a query node (Query Node) in the logical execution plan to query the data table to be connected through the query node.
904、通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key。904. Query the data table to be connected through the query node in the logical execution plan, and obtain the key Key of the data skew.
905、根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join。905. Modify the logic execution plan according to the data skew Key and the data skew policy, so that the first subtable to be joined corresponding to the data skew Key split from the data table to be joined When the distributed computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed.
本实施例中步骤904和步骤905与上述实施例中步骤301和步骤302相类似,此处不再赘述。Step 904 and step 905 in this embodiment are similar to step 301 and step 302 in the above embodiment, and will not be repeated here.
906、对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划。906. Update the modified logical execution plan to obtain an updated logical execution plan.
907、对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划。907. Optimize the updated logic execution plan to obtain an optimized logic execution plan.
908、将所述优化后的逻辑执行计划转换为物理执行计划。908. Convert the optimized logical execution plan into a physical execution plan.
具体的,逻辑执行计划修改后,需要进行更新,以便后续步骤应用该更新后的逻辑执行计划。将修改后的逻辑执行计划传递到优化器Catalyst Optimizer进行优化,然后被经过一些列优化生成优化后的逻辑执行计划。Spark将此逻辑执行计划转化为一个物理执行计划,检查可行的优化策略,并在此过程中检查优化。物理执行计划通过生成不同的物理执行操作,并通过代价模型进行比较分析,从而确定如何在集群上执行逻辑计划。Spark在选择一个物理计划时,Spark将所有代码运行在Spark的底层编程接口RDD上。Spark在运行时执行进一步优化,生成可以在执行期间优化任务或阶段的本地Java字节码,最终将结果返回给用户。Specifically, after the logical execution plan is modified, it needs to be updated, so that subsequent steps can apply the updated logical execution plan. Pass the modified logical execution plan to the optimizer Catalyst Optimizer for optimization, and then generate an optimized logical execution plan through a series of optimizations. Spark translates this logical execution plan into a physical execution plan, checking for feasible optimization strategies, and checking for optimizations along the way. The physical execution plan determines how to execute the logical plan on the cluster by generating different physical execution operations and conducting comparative analysis through the cost model. When Spark chooses a physical plan, Spark runs all codes on Spark's underlying programming interface RDD. Spark performs further optimizations at runtime, generates native Java bytecode that can optimize tasks or stages during execution, and finally returns the results to the user.
本发明实施例提供的数据倾斜处理方法,通过在逻辑执行计划优化前加入查询节点,并将查询到的数据倾斜的Key,与数据非倾斜的Key进行拆分,使得Spark在执行物理执行计划时,能够将拆分得到的数据倾斜的Key对应第一待连接子表进行Map Join,从而能够大大节省运行时间。从根源上规避了数据倾斜带来的诸多问题,减少了场景限制,避免了对统计 信息的依赖,提高了数据倾斜处理的全面性和准确性。In the data skew processing method provided by the embodiment of the present invention, by adding query nodes before the optimization of the logical execution plan, and splitting the queried data skewed Key from the data non-slanted Key, Spark executes the physical execution plan , it is possible to perform Map Join on the skewed Key of the split data corresponding to the first sub-table to be joined, which can greatly save the running time. It avoids many problems caused by data skew from the root, reduces scene restrictions, avoids dependence on statistical information, and improves the comprehensiveness and accuracy of data skew processing.
图10为本发明一实施例提供的数据倾斜处理设备的结构示意图。如图10所示,该数据倾斜处理设备100包括:查询模块1001、修改模块1002以及生成模块1003。FIG. 10 is a schematic structural diagram of a data skew processing device provided by an embodiment of the present invention. As shown in FIG. 10 , the data skew processing device 100 includes: a query module 1001 , a modification module 1002 and a generation module 1003 .
查询模块1001,用于通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key。The query module 1001 is configured to query the data table to be connected through the query node in the logic execution plan, and obtain the key Key of the data skew.
修改模块1002,用于根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join。A modifying module 1002, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the data skewed key corresponding to the first When the distributed computing engine Spark executes the physical execution plan for the subtables to be joined, Map Join is performed at the mapping end.
生成模块1003,用于根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。The generating module 1003 is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
本发明实施例提供的数据倾斜处理设备,The data skew processing device provided by the embodiment of the present invention,
在一种可能的设计中,所述查询模块具体用于:In a possible design, the query module is specifically used for:
针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;
若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
在一种可能的设计中,所述修改模块具体用于:In a possible design, the modification module is specifically used for:
在所述逻辑执行计划中增加以下处理步骤:Add the following processing steps in the logic execution plan:
根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表;Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;
将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
在一种可能的设计中,所述修改模块具体用于:In a possible design, the modification module is specifically used for:
通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
在一种可能的设计中,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。In a possible design, the to-be-connected data tables include a first data table and a second data table, and the key of the data skew comes from the first data table and/or the second data table.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;In a possible design, there are multiple keys for the data tilt, multiple first sub-tables to be connected, and a one-to-one correspondence between the key for the data tilt and the first sub-table to be connected;
所述修改模块具体用于:针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。The modification module is specifically used for: for each data-slanted Key, the first sub-table to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
在一种可能的设计中,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;所述修改模块还用于:In a possible design, there are multiple keys for the data skew, and at least one first subtable to be connected; the modification module is also used for:
所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;There are multiple keys for the data skew, and at least one first subtable to be connected;
针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
在一种可能的设计中,所述设备,还包括:In a possible design, the device further includes:
语法分析模块,用于将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划。The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan.
解析模块,用于对所述未解析逻辑执行计划进行解析,得到逻辑执行计划。The parsing module is configured to parse the unparsed logical execution plan to obtain the logical execution plan.
创建模块,用于在所述逻辑执行计划中加入所述查询节点。A creation module is used for adding the query node in the logical execution plan.
在一种可能的设计中,所述生成模块具体用于:In a possible design, the generating module is specifically used for:
对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划;Updating the modified logical execution plan to obtain an updated logical execution plan;
对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划;Optimizing the updated logical execution plan to obtain the optimized logical execution plan;
将所述优化后的逻辑执行计划转换为物理执行计划。Converting the optimized logical execution plan into a physical execution plan.
在一种可能的设计中,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。In a possible design, when the second to-be-joined subtable Spark executes the physical execution plan, reduce join of the merge end link is performed.
在一种可能的设计中,所述Map Join为广播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。In a possible design, the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sort-merge link SortMergeJoin.
本发明实施例提供的数据倾斜处理设备,可用于执行上述的方法实施例,其实现原理和技术效果类似,本实施例此处不再赘述。The data skew processing device provided by the embodiment of the present invention can be used to execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, so this embodiment will not repeat them here.
图11为本发明一实施例提供的数据倾斜处理设备的硬件结构示意图,该设备可以是计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。Fig. 11 is a schematic diagram of the hardware structure of a data tilt processing device provided by an embodiment of the present invention. The device may be a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc. .
设备110可以包括以下一个或多个组件:处理组件1101,存储器1102,电源组件1103,输入/输出(I/O)接口1104,以及通信组件1106。 Device 110 may include one or more of the following components: processing component 1101 , memory 1102 , power supply component 1103 , input/output (I/O) interface 1104 , and communication component 1106 .
处理组件1101通常控制设备110的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件1101可以包括一个或多个处理器1105来执行指令,以完成上述的方法的全部或部分步骤。此外, 处理组件1101可以包括一个或多个模块,便于处理组件1101和其他组件之间的交互。例如,处理组件1101可以包括多媒体模块,以方便多媒体组件808和处理组件1101之间的交互。 Processing component 1101 generally controls the overall operations of device 110, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1101 may include one or more processors 1105 to execute instructions to complete all or part of the steps of the above method. Additionally, the processing component 1101 may include one or more modules to facilitate interaction between the processing component 1101 and other components. For example, processing component 1101 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 1101 .
存储器1102被配置为存储各种类型的数据以支持在设备110的操作。这些数据的示例包括用于在设备110上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器1102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。 Memory 1102 is configured to store various types of data to support operations at device 110 . Examples of such data include instructions for any application or method operating on device 110, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1102 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
电源组件1103为设备110的各种组件提供电力。电源组件1103可以包括电源管理系统,一个或多个电源,及其他与为设备110生成、管理和分配电力相关联的组件。The power supply component 1103 provides power to various components of the device 110 . Power components 1103 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 110 .
I/O接口1104为处理组件1101和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 1104 provides an interface between the processing component 1101 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
通信组件1106被配置为便于设备110和其他设备之间有线或无线方式的通信。设备110可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件1106经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件1106还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 1106 is configured to facilitate wired or wireless communications between device 110 and other devices. The device 110 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1106 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1106 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
在示例性实施例中,设备110可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, device 110 may be programmed by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上数据倾斜处理设备执行的数据倾斜处理方法。The present application also provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the data skew processing method performed by the above data skew processing device is realized.
上述的计算机可读存储介质,上述可读存储介质可以是由任何类型的易 失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。可读存储介质可以是通用或专用计算机能够存取的任何可用介质。The above-mentioned computer-readable storage medium, the above-mentioned readable storage medium can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
一种示例性的可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称:ASIC)中。当然,处理器和可读存储介质也可以作为分立组件存在于设备中。An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC for short). Of course, the processor and the readable storage medium can also exist in the device as discrete components.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
本发明实施例还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时,实现如上数据倾斜处理设备执行的数据倾斜处理方法。An embodiment of the present invention also provides a computer program product, including a computer program. When the computer program is executed by a processor, the above data skew processing method performed by the data skew processing device is implemented.
本发明实施例还提供一种运行指令的芯片,所述芯片包括存储器、处理器,所述存储器中存储代码和数据,所述存储器与所述处理器耦合,所述处理器运行所述存储器中的代码使得所述芯片用于执行上述数据倾斜处理设备执行的数据倾斜处理方法。The embodiment of the present invention also provides a chip for running instructions. The chip includes a memory and a processor. Codes and data are stored in the memory. The memory is coupled to the processor. The processor runs the memory in the memory. The code enables the chip to execute the data skew processing method performed by the above data skew processing device.
本发明实施例还提供了一种计算机程序,当所述计算机程序被处理器执行时,用于执行上述数据倾斜处理设备执行的数据倾斜处理方法。An embodiment of the present invention further provides a computer program, which is used to execute the data skew processing method performed by the above data skew processing device when the computer program is executed by a processor.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims (27)

  1. 一种数据倾斜处理方法,其特征在于,包括:A data skew processing method, characterized in that, comprising:
    通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key;Query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;
    根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join;Modify the logical execution plan according to the key of the data skew and the data skew strategy, so that the first subtable to be connected corresponding to the key of the data skew obtained from the split of the data table to be connected is distributed When the formula computing engine Spark executes the physical execution plan, the mapping end link Map Join is performed;
    根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。Generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
  2. 根据权利要求1所述的方法,其特征在于,所述通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的Key,包括:The method according to claim 1, wherein the step of querying the data table to be connected through the query node in the logic execution plan to obtain the Key of data skew includes:
    针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;
    若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,包括:The method according to claim 1 or 2, wherein modifying the logic execution plan according to the key and data skew strategy of the data skew includes:
    在所述逻辑执行计划中增加以下处理步骤:Add the following processing steps in the logic execution plan:
    根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表;Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;
    将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
  4. 根据权利要求3所述的方法,其特征在于,所述将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,包括:The method according to claim 3, wherein the merging the connected first sub-table to be connected and the second sub-table to be connected after the connection comprises:
    通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
  5. 根据权利要求3所述的方法,其特征在于,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。The method according to claim 3, wherein the data table to be connected includes a first data table and a second data table, and the Key of the data tilt comes from the first data table and/or the second data table Two data sheets.
  6. 根据权利要求5所述的方法,其特征在于,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;The method according to claim 5, characterized in that there are multiple Keys for the data tilt, multiple subtables to be connected, and the Key for the tilted data and the first subtable to be connected one-to-one correspondence;
    所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join,包括:The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:
    针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each data-slanted Key, Map Join is performed on the first subtable to be joined corresponding to the data-slanted Key when Spark executes the physical execution plan.
  7. 根据权利要求5所述的方法,其特征在于,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;The method according to claim 5, wherein there are multiple Keys for the data skew, and at least one of the first sub-tables to be connected;
    所述根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,还包括:The modifying the logic execution plan according to the key of the data skew and the data skew policy also includes:
    将所述数据倾斜的Key进行分组,得到多个分组;每个分组的总数据量小于第二预设阈值;grouping the data-slanted Keys to obtain multiple groups; the total data volume of each group is less than a second preset threshold;
    每个分组对应一个所述第一待连接子表;Each group corresponds to one of the first to-be-connected sub-tables;
    所述从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join,包括:The first to-be-connected sub-table corresponding to the key corresponding to the data tilt obtained by splitting the data table to be connected performs Map Join when Spark executes the physical execution plan, including:
    针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述通过逻辑执行计划中的查询节点对待连接数据表进行查询之前,还包括:The method according to any one of claims 1 to 7, wherein before querying the data table to be connected through the query node in the logical execution plan, further comprising:
    将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划;对所述未解析逻辑执行计划进行解析,得到逻辑执行计划;Parsing the structured query language SQL text into a syntax tree to generate an unparsed logic execution plan; parsing the unparsed logic execution plan to obtain a logic execution plan;
    在所述逻辑执行计划中加入所述查询节点。The query node is added to the logical execution plan.
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述根据修改后的所述逻辑执行计划,生成所述物理执行计划,包括:The method according to any one of claims 1 to 8, wherein generating the physical execution plan according to the modified logical execution plan includes:
    对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划;Updating the modified logical execution plan to obtain an updated logical execution plan;
    对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划;Optimizing the updated logical execution plan to obtain the optimized logical execution plan;
    将所述优化后的逻辑执行计划转换为物理执行计划。Converting the optimized logical execution plan into a physical execution plan.
  10. 根据权利要求3至7任一项所述的方法,其特征在于,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。The method according to any one of claims 3 to 7, characterized in that, when the second sub-table to be joined Spark executes the physical execution plan, the merge terminal link Reduce Join is performed.
  11. 根据权利要求10所述的方法,其特征在于,所述Map Join为广 播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。The method according to claim 10, wherein said Map Join is a broadcast table link BroadcastHashJoin, and said Reduce Join is a sorting merge link SortMergeJoin.
  12. 一种数据倾斜处理设备,其特征在于,包括:A data skew processing device, characterized in that it comprises:
    查询模块,用于通过逻辑执行计划中的查询节点对待连接数据表进行查询,获得数据倾斜的键Key;The query module is used to query the data table to be connected through the query node in the logical execution plan to obtain the key Key of the data skew;
    修改模块,用于根据所述数据倾斜的Key与数据倾斜策略对所述逻辑执行计划进行修改,以使从所述待连接数据表中拆分得到的所述数据倾斜的Key对应的第一待连接子表在分布式计算引擎Spark执行物理执行计划时进行映射端链接Map Join;A modifying module, configured to modify the logic execution plan according to the data skewed Key and the data skewed strategy, so that the first waiting list corresponding to the data skewed Key split from the data table to be connected The connection sub-table is linked to Map Join when the distributed computing engine Spark executes the physical execution plan;
    生成模块,用于根据修改后的所述逻辑执行计划,生成所述物理执行计划,以通过Spark执行所述物理执行计划。The generating module is configured to generate the physical execution plan according to the modified logical execution plan, so as to execute the physical execution plan through Spark.
  13. 根据权利要求12所述的设备,其特征在于,所述查询模块具体用于:The device according to claim 12, wherein the query module is specifically used for:
    针对所述待连接数据表的每个Key,将所述Key对应的数据量与第一预设阈值进行对比;For each Key of the data table to be connected, comparing the amount of data corresponding to the Key with a first preset threshold;
    若所述Key对应的数据量大于所述第一预设阈值,则将所述Key确定为数据倾斜的Key。If the amount of data corresponding to the Key is greater than the first preset threshold, the Key is determined as a Key with skewed data.
  14. 根据权利要求12或13所述的设备,其特征在于,所述修改模块具体用于:The device according to claim 12 or 13, wherein the modifying module is specifically used for:
    在所述逻辑执行计划中增加以下处理步骤:Add the following processing steps in the logic execution plan:
    根据所述数据倾斜的Key,将所述待连接数据表的数据进行拆分,得到所述第一待连接子表和数据非倾斜的Key对应的第二待连接子表;Split the data of the data table to be connected according to the Key of the data skew, and obtain the first subtable to be connected and the second subtable to be connected corresponding to the Key of non-slanted data;
    将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first subtable to be connected and the connected second subtable to be connected to obtain a final data table.
  15. 根据权利要求14所述的设备,其特征在于,所述修改模块具体用于:The device according to claim 14, wherein the modifying module is specifically used for:
    通过Union算子,将连接后的所述第一待连接子表和连接后的所述第二待连接子表进行合并,得到最终的数据表。Merge the connected first to-be-connected sub-table and the connected second to-be-connected sub-table through a Union operator to obtain a final data table.
  16. 根据权利要求14所述的设备,其特征在于,所述待连接数据表包括第一数据表和第二数据表,所述数据倾斜的Key来自于所述第一数据表和/或所述第二数据表。The device according to claim 14, wherein the data table to be connected includes a first data table and a second data table, and the Key of the data tilt comes from the first data table and/or the second data table Two data sheets.
  17. 根据权利要求16所述的设备,其特征在于,所述数据倾斜的Key为多个,所述第一待连接子表为多个,所述数据倾斜的Key与所述第一待连接子表一一对应;The device according to claim 16, characterized in that there are multiple Keys for the data tilt, multiple first sub-tables to be connected, and the Key for the data tilt and the first sub-table to be connected one-to-one correspondence;
    所述修改模块具体用于:针对每个数据倾斜的Key,所述数据倾斜的Key对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。The modification module is specifically used for: for each data-slanted Key, the first sub-table to be connected corresponding to the data-slanted Key performs Map Join when Spark executes the physical execution plan.
  18. 根据权利要求16所述的设备,其特征在于,所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;The device according to claim 16, wherein there are multiple keys for the data skew, and at least one first sub-table to be connected;
    所述修改模块还用于:The modification module is also used to:
    所述数据倾斜的Key为多个,所述第一待连接子表为至少一个;There are multiple keys for the data skew, and at least one first subtable to be connected;
    针对每个分组,所述分组对应的第一待连接子表在Spark执行物理执行计划时进行Map Join。For each group, Map Join is performed on the first subtable to be joined corresponding to the group when Spark executes the physical execution plan.
  19. 根据权利要求12至18任一项所述的设备,其特征在于,还包括:The device according to any one of claims 12 to 18, further comprising:
    语法分析模块,用于将结构化查询语言SQL文本解析成语法树,生成未解析逻辑执行计划;The syntax analysis module is used to parse the structured query language SQL text into a syntax tree and generate an unparsed logic execution plan;
    解析模块,用于对所述未解析逻辑执行计划进行解析,得到逻辑执行计划;A parsing module, configured to parse the unparsed logical execution plan to obtain a logical execution plan;
    创建模块,用于在所述逻辑执行计划中加入所述查询节点。A creation module is used for adding the query node in the logical execution plan.
  20. 根据权利要求12至19任一项所述的设备,其特征在于,所述生成模块具体用于:The device according to any one of claims 12 to 19, wherein the generating module is specifically used for:
    对修改后的所述逻辑执行计划进行更新,得到更新后的逻辑执行计划;Updating the modified logical execution plan to obtain an updated logical execution plan;
    对所述更新后的逻辑执行计划进行优化,得到优化后的逻辑执行计划;Optimizing the updated logical execution plan to obtain the optimized logical execution plan;
    将所述优化后的逻辑执行计划转换为物理执行计划。Converting the optimized logical execution plan into a physical execution plan.
  21. 根据权利要求14至18任一项所述的设备,其特征在于,所述第二待连接子表Spark执行物理执行计划时进行归并端链接Reduce Join。The device according to any one of claims 14 to 18, characterized in that when the second to-be-joined sub-table Spark executes the physical execution plan, reduce join of the merging end link is performed.
  22. 根据权利要求21所述的设备,其特征在于,所述Map Join为广播表链接BroadcastHashJoin,所述Reduce Join为排序合并链接SortMergeJoin。The device according to claim 21, wherein the Map Join is a broadcast table link BroadcastHashJoin, and the Reduce Join is a sorting merge link SortMergeJoin.
  23. 一种数据倾斜处理设备,其特征在于,包括:至少一个处理器和存储器;A data skew processing device, characterized in that it includes: at least one processor and a memory;
    所述存储器存储计算机执行指令;the memory stores computer-executable instructions;
    所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所 述至少一个处理器执行如权利要求1至11任一项所述的数据倾斜处理方法。The at least one processor executes the computer-executed instructions stored in the memory, so that the at least one processor executes the data skew processing method according to any one of claims 1 to 11.
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至11任一项所述的数据倾斜处理方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the method described in any one of claims 1 to 11 is realized. Data skew processing method.
  25. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至11任一项所述的数据倾斜处理方法。A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the data skew processing method according to any one of claims 1 to 11 is implemented.
  26. 一种运行指令的芯片,其特征在于,所述芯片包括存储器、处理器,所述存储器中存储代码和数据,所述存储器与所述处理器耦合,所述处理器运行所述存储器中的代码使得所述芯片用于执行上述权利要求1至11任一项所述的数据倾斜处理方法。A chip for running instructions, characterized in that the chip includes a memory and a processor, codes and data are stored in the memory, the memory is coupled to the processor, and the processor runs the code in the memory The chip is used to execute the data skew processing method described in any one of claims 1 to 11 above.
  27. 一种计算机程序,其特征在于,当所述计算机程序被处理器执行时,用于执行上述权利要求1至11任一项所述的数据倾斜处理方法。A computer program, characterized in that, when the computer program is executed by a processor, it is used to execute the data skew processing method according to any one of claims 1 to 11.
PCT/CN2022/084642 2021-09-27 2022-03-31 Data skew processing method, device, storage medium, and program product WO2023045295A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111139049.1 2021-09-27
CN202111139049.1A CN113821541A (en) 2021-09-27 2021-09-27 Data skew processing method, apparatus, storage medium, and program product

Publications (1)

Publication Number Publication Date
WO2023045295A1 true WO2023045295A1 (en) 2023-03-30

Family

ID=78921369

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084642 WO2023045295A1 (en) 2021-09-27 2022-03-31 Data skew processing method, device, storage medium, and program product

Country Status (2)

Country Link
CN (1) CN113821541A (en)
WO (1) WO2023045295A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009094A (en) * 2023-10-07 2023-11-07 联通在线信息科技有限公司 Data oblique scattering method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data skew processing method, apparatus, storage medium, and program product
CN117149717A (en) * 2023-08-31 2023-12-01 中电云计算技术有限公司 Table connection processing method, apparatus, device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930479A (en) * 2016-04-28 2016-09-07 乐视控股(北京)有限公司 Data skew processing method and apparatus
CN105975463A (en) * 2015-09-25 2016-09-28 武汉安天信息技术有限责任公司 Method and system for identifying and optimizing data skewness based on MapReduce
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN107066612A (en) * 2017-05-05 2017-08-18 郑州云海信息技术有限公司 A kind of self-adapting data oblique regulating method operated based on SparkJoin
CN109299131A (en) * 2018-11-14 2019-02-01 百度在线网络技术(北京)有限公司 A kind of spark querying method that supporting trust computing and system
CN110673794A (en) * 2019-09-18 2020-01-10 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
US10691597B1 (en) * 2019-08-10 2020-06-23 MIFrontiers Corporation Method and system for processing big data
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data skew processing method, apparatus, storage medium, and program product

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975463A (en) * 2015-09-25 2016-09-28 武汉安天信息技术有限责任公司 Method and system for identifying and optimizing data skewness based on MapReduce
CN105930479A (en) * 2016-04-28 2016-09-07 乐视控股(北京)有限公司 Data skew processing method and apparatus
CN106874322A (en) * 2016-06-27 2017-06-20 阿里巴巴集团控股有限公司 A kind of data table correlation method and device
CN107066612A (en) * 2017-05-05 2017-08-18 郑州云海信息技术有限公司 A kind of self-adapting data oblique regulating method operated based on SparkJoin
CN109299131A (en) * 2018-11-14 2019-02-01 百度在线网络技术(北京)有限公司 A kind of spark querying method that supporting trust computing and system
US10691597B1 (en) * 2019-08-10 2020-06-23 MIFrontiers Corporation Method and system for processing big data
CN110673794A (en) * 2019-09-18 2020-01-10 中兴通讯股份有限公司 Distributed data equalization processing method and device, computing terminal and storage medium
CN113821541A (en) * 2021-09-27 2021-12-21 北京沃东天骏信息技术有限公司 Data skew processing method, apparatus, storage medium, and program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009094A (en) * 2023-10-07 2023-11-07 联通在线信息科技有限公司 Data oblique scattering method and device, electronic equipment and storage medium
CN117009094B (en) * 2023-10-07 2024-02-23 联通在线信息科技有限公司 Data oblique scattering method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113821541A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
WO2023045295A1 (en) Data skew processing method, device, storage medium, and program product
US11567942B2 (en) Unified table query processing
US10534764B2 (en) Partial merge
US11036756B2 (en) In-memory key-value store for a multi-model database
US9740715B2 (en) Deleting records in a multi-level storage architecture
US11010415B2 (en) Fixed string dictionary
US9298775B2 (en) Changing the compression level of query plans
US20190391978A1 (en) Object Encoding and Computation Method in Database System and Database Server
US20180039671A1 (en) Method and apparatus for querying data in cross-shard databases
US11941034B2 (en) Conversational database analysis
US10635668B2 (en) Intelligently utilizing non-matching weighted indexes
US10783142B2 (en) Efficient data retrieval in staged use of in-memory cursor duration temporary tables
CN104765731A (en) Database query optimization method and equipment
US9031976B2 (en) Flexible tables
US10324933B2 (en) Technique for processing query in database management system
US20210109974A1 (en) Query Execution On Compressed In-Memory Data
US8321429B2 (en) Accelerating queries using secondary semantic column enumeration
US11379485B2 (en) Inferred predicates for query optimization
CN107832387B (en) SQL statement analysis method based on FMDB
Jin et al. The research for storage scheme based on Hadoop
US20230229657A1 (en) Zero Copy Optimization for SELECT * Queries
CN117762975A (en) Data query method, device, electronic equipment and storage medium
CN112785247A (en) Digital intelligent recruitment management system
TW201409265A (en) Migration system and method of database and computer readable medium thereof
CN117390024A (en) Data query method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871360

Country of ref document: EP

Kind code of ref document: A1