CN117827881A

CN117827881A - Spark SQL Shuffle task number optimizing system based on historical information

Info

Publication number: CN117827881A
Application number: CN202410013742.1A
Authority: CN
Inventors: 曹俊亮; 赵智峰; 龙怡霖; 王晓东; 常毅; 夏军生; 王勇强
Original assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Current assignee: Nanjing Fiberhome Telecommunication Technologies Co ltd
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-05

Abstract

The invention discloses a Spark SQL Shuffle task number optimizing system based on historical information, which relates to the fields of big data, databases and machine learning and comprises an SQL historical operation information extraction module, an SQL historical operation information pre-analysis module, an SQL similarity measurement module, an HBO parameter calculation module and an HBO parameter recommendation service module; a recommendation model based on historical information is introduced into a Spark SQL engine, and a series of problems caused by static shuffle task numbers are avoided by analyzing the shuffle operation information of the historical SQL and a machine learning algorithm, calculating tuning parameters, guiding the current SQL to operate more efficiently and robustly, realizing the task number of recommending each shuffle stage and realizing the setting of the dynamically self-adaptive shuffle task number.

Description

Spark SQL Shuffle task number optimizing system based on historical information

Technical Field

The invention belongs to the technical fields of big data, databases and machine learning, and particularly relates to a Spark SQL Shuffle task number optimization system based on historical information.

Background

The distributed computing platform provides convenience for efficiently processing mass data, and Spark is widely used in industry by virtue of memory-based computing. Spark SQL uses SQL-like syntax as a high-level data manipulation API, providing a facility for practitioners familiar with relational database management systems to use Spark calculation engines.

Data shuffling (Shuffle) is an indispensable process in Spark, and is triggered when the SQL execution statement contains a group by/join/break/partition by operator. The number of Shuffle tasks is one of the important parameters affecting the performance of Spark SQL, and setting of the parameters directly affects the SQL execution success rate, the number of output files and the utilization rate of cluster resources. Currently this parameter is validated mainly by two ways: (1) setting a general static value when starting a Spark application; (2) On a resident Spark application, when a single SQL is issued, a comment primitive is used to insert a static value, only the current SQL is validated.

The static Shuffle task number cannot be suitable for each SQL, and the partition number cannot be adaptively adjusted in one SQL task according to the data volume of each Shuffle stage. When the number of the Shuffle tasks is larger than the actual demand, frequent task scheduling is caused, so that the calculation resource is wasted, a large number of small files and random read-write operations thereof are generated, the IOPS load of the data node is overlarge, and other SQL requests cannot be responded in time. When the number of the Shuffle tasks is smaller than the actual demand, a single task is located

The data amount is large, the concurrency is reduced, the task running time is too long, and the computing node OOM is caused.

Disclosure of Invention

Aiming at the defects of the background technology, the technical problem to be solved by the invention is to provide a Spark SQL Shuffle task number optimizing system based on historical information, a recommending model based on the historical information is introduced into a Spark SQL engine, and a machine learning algorithm is combined by analyzing the shuffle operation information of the historical SQL to calculate tuning parameters, so that the current SQL is guided to operate more efficiently and robustly, the task number of each shuffle stage is recommended, the dynamically self-adaptive setting of the shuffle task number is realized, and a series of problems caused by the static shuffle task number are avoided.

The invention adopts the following technical scheme for solving the technical problems:

a Spark SQL Shuffle task number optimizing system based on historical information, which is characterized in that: the system comprises an SQL historical operation information extraction module, an SQL historical operation information pre-analysis module, an SQL similarity measurement module, an HBO parameter calculation module and an HBO parameter recommendation service module;

the SQL historical operation information extraction module is used for collecting and extracting related SQL operation information and indexes from the Spark event log and providing input data for the SQL historical operation information pre-analysis module and the SQL similarity measurement module;

the SQL historical operation information pre-analysis module is used for counting and analyzing various operation indexes of the historical SQL, taking Stage-level operation information in the historical operation information as a main analysis object, and forming task number calculation reference information with Stage corresponding to the SQL as main granularity after processing, so as to calculate task numbers;

the SQL similarity measurement module is used for extracting SQL sentences and AST from historical SQL operation information, converting the SQL sentences into feature vectors through attribute extraction and feature extraction, carrying out similarity measurement, and generating SQL cluster information and SQL cluster recognition models which are respectively used for task number calculation and task number recommendation service;

the HBO parameter calculation module is used for calculating the task number of Stage of the Shuffle type in each SQL cluster by using the Stage level reference information output by the SQL historical operation information pre-analysis module and the SQL cluster information output by the SQL similarity measurement module, and forming a knowledge base for parameter recommendation service;

the HBO parameter recommendation service module is used for providing task number recommendation service for the SQL Core module, the SQL Core module calls the task number recommendation service when the new SQL is operated, SQL sentences, AST and other parameters are transmitted, and the parameter recommendation service module recommends the task number parameters of Stage of the new SQLShuffle type in real time according to the task number knowledge base and the SQL cluster recognition model generated by the upstream service module, so that guidance is provided for optimizing the operation of the new SQL.

As a further preferable scheme of the Spark SQL Shuffle task number optimizing system based on the historical information, the SQL historical operation information extraction module is realized as follows:

the Spark event log is read to extract SQL historical operation information, the events related to SQL operation are extracted and stored according to the event type sub-table, and event groups with starting and ending relations are associated to form each granularity information detail table.

As a further preferable scheme of the Spark SQL Shuffle task number optimizing system based on the historical information, the SQL similarity measuring module is realized as follows:

SQL and AST data recorded by an effective Task detail table and a last period HBO parameter recommendation service module are used as input, SQL with a Shuffle activity is filtered, the attribute information of an AST grammar tree of the SQL is extracted, text information is converted into a numerical vector as a main feature by a text feature extraction algorithm on an extraction result, then the numerical vector is combined with the structural feature of the AST grammar tree to complete feature construction, and SQL similarity measurement is carried out after normalized feature processing; the method comprises the steps of screening and extracting SQL clusters by using a highly connected subgraph clustering algorithm, taking a single sample as a point, taking a similarity value among samples as a weight of an edge, outputting all highly connected subgraphs as SQL cluster information based on connectivity of a graph, and using the SQL cluster information as input of an HBO parameter calculation module; and training the marked sample through a nearest neighbor algorithm to output an SQL cluster recognition model, and then using the SQL cluster recognition model in the HBO parameter recommendation service module to recognize the SQL category.

As a further preferable scheme of the Spark SQL Shuffle task number optimizing system based on the historical information, the HBO parameter calculating module combines the Stage data table and SQL cluster information, and associates the Stage data table with the SQL cluster information to obtain knowledge base data for HBO parameter recommending service.

As a further preferable scheme of the Spark SQL Shuffle task number optimizing system based on the historical information, the HBO parameter recommending service module comprises the following implementation processes:

loading an HBO parameter recommendation knowledge base output by the HBO parameter calculation module and an SQL cluster recognition model output by the SQL similarity measurement module during initialization, and performing SQL cluster recognition and HBO parameter dynamic calculation on the new SQL;

after receiving the HBO parameter recommendation request, the service sequentially performs feature extraction, SQL cluster identification and HBO parameter dynamic calculation according to the input SQL and AST to obtain HBO recommendation parameters, and the parameters are combined into a physical plan to be effective.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

1. according to the SQL historical execution condition, the invention adaptively distributes the task number of the Shuffle stage, can more reasonably utilize CPU resources, improve the utilization rate of computing resources, avoid frequent task scheduling and random read-write operation of a large number of small files caused by too small fragments, and lighten the IOPS load of data nodes; the concurrent reduction caused by oversized fragments and overlong task running time are avoided, and the occurrence of OOM (on-line optimization) of the computing nodes is avoided;

2. the invention optimizes the setting mode of the number of the Shuffle tasks and reduces the dependence and influence of the parameter setting on the manual experience;

3. the method and the device have the advantages that the Spark cluster SQL task throughput is improved while the Spark cluster computing resource utilization rate is improved, and more SQL tasks can be borne.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Figure 1 is a schematic diagram of the HBO inter-module architecture of the present invention;

FIG. 2 is a main flow of updating the HBO parameter recommendation knowledge base of the invention;

fig. 3 is a main flow of HBO parameter recommendation service according to the present invention.

Description of the embodiments

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

the following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The objects and effects of the present invention will become more apparent from the following detailed description of the preferred embodiments and the accompanying drawings, it being understood that the specific embodiments described herein are merely illustrative of the invention and not limiting thereof.

Interpretation of the terms

The related concepts of the present invention are as follows:

(1) HBO, history Based Optimization, history-based optimization;

(2) IOPS, input/Output Operations Per Second, the number of read/write operations per second is a measurement for testing the performance of a computer storage device (e.g., a hard disk);

(3) OOM, out Of Memory, beyond Memory limits;

(4) AST, abstract Syntax Tree, abstract syntax tree;

(5) SQL, structured Query Language, a programming language for managing and operating relational databases;

(6) Spark, apache Spark, fast general purpose computing engines designed specifically for large-scale data processing;

(7) A process of mixing data distributed at different nodes by Spark according to a certain rule;

(8) Stage, computing Stage, wherein the task operation process of Spark is composed of a series of Stage;

(9) Task, calculating Task, spark executing minimum calculation unit of data operation, stage in Spark is formed from a group of Task;

(10) The Shuffle Stage contains Stage of Shuffle operation;

(11) The SQL Core module is a built-in module in the Spark for analyzing and processing SQL tasks.

In enterprise-level applications, spark SQL generally provides query and analysis services for massive data to an upper-layer big data management platform in a resident application mode, so that at present, many big data management platforms can complete automatic generation of SQL through simple dragging, a template is solidified, and service similarity represents that SQL results have high similarity. In addition, the accumulated presentation of the business data has periodicity, so that the optimization calculation flow combined with the history information has high feasibility.

The invention aims to introduce a recommendation model based on historical information into a Spark SQL engine, calculate tuning parameters by analyzing the shuffle operation information of the historical SQL and a machine learning algorithm, guide the current SQL to operate more efficiently and robustly, realize recommending the task number of each shuffle stage, realize the setting of the dynamically self-adaptive shuffle task number and avoid a series of problems brought by the static shuffle task number.

The HBO task number optimization scheme divides modules according to functional content, and the hierarchical architecture of each module layer is shown in figure 1.

The SQL historical operation information extraction module is responsible for collecting and extracting related SQL operation information and indexes from event logs of Spark SQL jobs and providing input data for the SQL historical operation information pre-analysis module and the SQL similarity measurement module.

And the SQL historical operation information pre-analysis module is used for counting and analyzing various operation indexes of the historical SQL, taking Stage-level operation information in the historical operation information as a main analysis object, and forming task number calculation reference information with Stage corresponding to the SQL as main granularity after processing, so as to calculate the task number.

The SQL similarity measurement module is used for extracting SQL sentences and AST from historical SQL operation information, converting the SQL sentences into feature vectors through attribute extraction and feature extraction, carrying out similarity measurement, and generating SQL cluster information and SQL cluster recognition models which are respectively used for task number calculation and task number recommendation service.

And the HBO parameter calculation module calculates the task number of the related Shuffle Stage in each SQL cluster by using the SQL Stage level reference information output by the SQL history operation information pre-analysis module and the SQL cluster information output by the SQL similarity measurement module, and forms a knowledge base for parameter recommendation service.

The HBO parameter recommendation service module provides task number recommendation service for the SQL Core, the SQL Core calls the task number recommendation service when the new SQL runs, and the SQL Core transmits SQL sentences, AST and other parameters, and the parameter recommendation service module recommends the task number parameters of the new SQL related Shuffle Stage in real time according to the task number knowledge base and the SQL cluster recognition model generated by the upstream service module, so as to provide guidance for the new SQL running optimization.

In order to improve the task number recommendation quality, a more historical information knowledge base and an SQL cluster recognition model are required to be iterated, the process of iterating and updating is completed in a periodic task mode, and a periodic task scheduling module completes updating of the historical SQL operation information, the task number knowledge base and the SQL cluster recognition model used in each sub-module.

The invention completes the whole flow of SQL historical information extraction, SQL historical information pre-analysis, SQL similarity measurement, HBO parameter calculation and HBO parameter recommendation service through the HBO parameter recommendation knowledge base updating module and the HBO parameter recommendation service module.

The HBO parameter recommendation knowledge base update module realizes the process as shown in the following figure 2:

the main flow for updating the HBO parameter recommendation knowledge base relates to an SQL historical information extraction module, an SQL historical information pre-analysis module, an SQL similarity measurement module and an HBO parameter calculation module. The SQL history information extraction module reads Spark Event logs to extract SQL history operation information, extracts events related to SQL operation, stores the events according to Event type sub-tables, associates Event groups with a 'start' and 'end' relation, and forms detail tables of information of each granularity.

And the SQL historical information pre-analysis module carries out cross-granularity association on each granularity information detail table to form a Task-level information detail broad table. And filtering according to whether the SQL of the Task is normally operated or not, wherein the filtered effective Task detail table is used as the input of the SQL similarity measurement module, and the filtered data is subjected to statistics of data quantity and indexes of other dimensions according to Stage aggregation to form a Stage-level data table which is used as the input of the HBO parameter calculation module.

The SQL similarity measurement module uses the SQL and AST data recorded by the effective Task detail table and the last period HBO parameter recommendation service module as input, filters SQL with the Shuffle activity, extracts the AST grammar tree attribute information of the SQL, converts the text information into a numerical vector as a main feature through a text feature extraction algorithm on the extraction result, then completes feature construction by combining the structural feature of the AST grammar tree, and carries out SQL similarity measurement after normalized feature processing. And screening and extracting the SQL clusters by using a highly connected subgraph clustering algorithm, taking a single sample as a point, taking the similarity value among samples as the weight of an edge, and outputting all the highly connected subgraphs as SQL cluster information based on the connectivity of the graph to be used as the input of an HBO parameter calculation module. And training the marked sample through a nearest neighbor algorithm to output an SQL cluster recognition model, and then using the SQL cluster recognition model in the HBO parameter recommendation service module to recognize the SQL category.

And the HBO parameter calculation module combines the Stage data table and the SQL cluster information, and associates the Stage data table with the SQL cluster information to obtain knowledge base data for the HBO parameter recommendation service.

The HBO parameter recommendation service module implementation process is as follows in fig. 3:

the HBO parameter recommendation service loads an HBO parameter recommendation knowledge base output by the HBO parameter calculation module and an SQL cluster recognition model output by the SQL similarity measurement module during initialization, and is used for carrying out SQL cluster recognition and HBO parameter dynamic calculation on the new SQL. After receiving the HBO parameter recommendation request, the service sequentially performs feature extraction, SQL cluster identification and HBO parameter dynamic calculation according to the input SQL and AST to obtain HBO recommendation parameters, and the parameters are combined into a physical plan to be effective.

Wherein, HBO is History Based Optimization, based on historical optimization.

The IOPS is Input/Output Operations Per Second, and the number of read-write operations per second is a measurement mode for testing the performance of a computer storage device (such as a hard disk).

The OOM is fully spelled as Out Memory, exceeding the Memory limit.

AST is a Abstract Syntax Tree abstract syntax tree.

In the process of converting the original logic plan into the physical plan, each time one Stage is executed, a new execution plan is updated according to the current execution result, and the Stage sequence in the operation is calculated through the SQL operation time Shuffle task number dynamic matching algorithm.

Based on the corresponding relation between each Stage and the task number in the column, the corresponding relation can correlate each Shuffle Stage task number recommended by the HBO with each corresponding Shuffle Stage, and adjust the Shuffle task number of each Shuffle Stage, so that the characteristic of dynamically adjusting the Shuffle task number is realized.

The pseudo code form of the SQL run-time Shuffle task number dynamic matching algorithm is as follows:

targetStagePointer ← 0

newStages ← createQueryStages(currentPhysicalPlan)

initialOffset ← newStages.length

(finishedStageCount, activeStageCount, pendingStageCount ) ←

getCurrentSQLStatus()

targetStagePointer ← initialOffset + finishedStageCount

+ activeStageCount - pendingStageCount

hboRepartition(newStages, targetStagePointer)

while ( !allStagesMaterialized(newStages) ) do

(finishedStageCount, activeStageCount, pendingStageCountStart) ←

getCurrentSQLStatus()

for stage in newStages do

if ( !materialized(stage) )

materialize(stage)

end if

end for

(finishedStageCount, activeStageCount, pendingStageCountEnd) ←

getCurrentSQLStatus()

pendingStageCount ← pendingStageCountEnd - pendingStageCountStart

targetStagePointer ← initialOffset + finishedStageCount

+ activeStageCount - pendingStageCount

hboRepartition(newStages, targetStagePointer)

meanwhile, a model is recommended for periodic iteration and optimization of HBO parameters, and the incoming SQL and AST records are required to be saved. In order to verify and evaluate the accuracy and effect of HBO parameter recommendation, HBO parameter recommendation result records need to be saved.

By combining the scheme, the Spark SQL log of a certain service cluster for nearly one week is selected for experimental analysis, and the test result is as follows:

the number of the tasks is compared, the total number of the tasks is reduced by 9.09%, and the total number of the shuffle tasks is reduced by 92.94%; task runtime contrast, average 24.25% reduction in total runtime, average 66.94% reduction in Shuffle Task runtime; compared with the overflow write data volume, the overflow write data volume of the memory is reduced by 57.32 percent on average, the overflow write data volume of the magnetic disk is reduced by 16.81 percent on average, and the probability of overflow write is reduced from 75.56 percent to 33.33 percent. The number of the output files is compared, and the total number of the output files is reduced by 90%.

It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the invention, and is not intended to limit the invention, but rather to limit the invention to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. All technical features of the present embodiment, which are included in the scope of the present invention, can be freely combined according to actual needs.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. A Spark SQL Shuffle task number optimizing system based on historical information, which is characterized in that: the system comprises an SQL historical operation information extraction module, an SQL historical operation information pre-analysis module, an SQL similarity measurement module, an HBO parameter calculation module and an HBO parameter recommendation service module;

2. The Spark SQL Shuffle task number optimizing system based on historical information of claim 1, wherein: the SQL historical operation information extraction module is realized as follows:

3. The Spark SQL Shuffle task number optimizing system based on historical information of claim 1, wherein: the SQL historical operation information pre-analysis module is realized as follows:

performing cross-granularity association on each granularity information detail table to form a Task-level information detail wide table; and filtering according to whether the SQL of the Task is normally operated or not, wherein the filtered effective Task detail table is used as the input of the SQL similarity measurement module, and the filtered data is subjected to statistics of data quantity and indexes of other dimensions according to Stage aggregation to form a Stage-level data table which is used as the input of the HBO parameter calculation module.

4. The Spark SQL Shuffle task number optimizing system based on historical information of claim 1, wherein: the SQL similarity measurement module is realized as follows:

SQL and AST data recorded by an effective Task detail table and a last period HBO parameter recommendation service module are used as input, SQL with a Shuffle activity is filtered, the attribute information of an AST grammar tree of the SQL is extracted, text information is converted into a numerical vector as a main feature by a text feature extraction algorithm on an extraction result, then the numerical vector is combined with the structural feature of the AST grammar tree to complete feature construction, and SQL similarity measurement is carried out after normalized feature processing;

the method comprises the steps of screening and extracting SQL clusters by using a highly connected subgraph clustering algorithm, taking a single sample as a point, taking a similarity value among samples as a weight of an edge, outputting all highly connected subgraphs as SQL cluster information based on connectivity of a graph, and using the SQL cluster information as input of an HBO parameter calculation module;

and training the marked sample through a nearest neighbor algorithm to output an SQL cluster recognition model, and then using the SQL cluster recognition model in the HBO parameter recommendation service module to recognize the SQL category.

5. The Spark SQL Shuffle task number optimizing system based on historical information of claim 1, wherein: and the HBO parameter calculation module combines the Stage data table and the SQL cluster information, and associates the Stage data table with the SQL cluster information to obtain knowledge base data for the HBO parameter recommendation service.

6. The Spark SQL Shuffle task number optimizing system based on historical information of claim 1, wherein: the HBO parameter recommendation service module comprises the following implementation processes: