CA3131725C

CA3131725C - Sql optimization method and device, computer equipment and storage medium

Info

Publication number: CA3131725C
Application number: CA3131725A
Authority: CA
Inventors: Shijin Gao
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2020-09-23
Filing date: 2021-09-23
Publication date: 2023-02-07
Anticipated expiration: 2041-09-23
Also published as: CN112434003B; CA3131725A1; CN112434003A

Abstract

Disclosed in the present invention are the SQL optimization method, device, computer apparatus and storage medium. The method comprises parsing SQL statements, identifying primary and secondary tables for the SQL statements; on a target platform, acquiring the record volume of pending SQL scripts corresponding to the primary and secondary tables for executing the SQL statements; classifying the primary and secondary tables according to the record volume pre-set classification rules to obtain classification results; optimizing pending SQL scripts according to pre-set optimization rules and classification results, then obtaining optimized SQL scripts. The method and the device can improve the readability of scripts by analyzing the execution plan of pending SQL scripts, intelligently reassembling and generating optimized SQL scripts, and formatting optimized SQL scripts with standard specifications. On the other hand, by saving large amounts of time and energy of the developers and obtaining high-quality SQL statements, labor costs is greatly reduced.

Description

SQL OPTIMIZATION METHOD AND DEVICE, COMPUTER EQUIPMENT AND
STORAGE MEDIUM
Technical Field [0001] The present invention relates to the field of database technologies, in particular, to a SQL
optimization method, a device, a computer apparatus and a storage medium.
Background

[0002] Currently, the growing amount of experience share and strategies for optimizing HIVE
SQL are available online. However, there lacks an automatic optimization tool for HIVE SQL.

[0003] Due to different levels of comprehension and professionalism by individual developer, the execution plans of HIVE SQL by different develops exhibit the great discrepancy. On the one hand, although many different SQL statements are valid under the same conditions, where some SQL statements may greatly affect the performance. On the other hand, different features of the codes by different developers may greatly affect the readability of the codes.

[0004] Therefore, it is necessary to provide a new SQL optimization method.
Summary

[0005] Accordingly, aiming at the forementioned technical problems, the present invention provides a SQL optimization method, a device, a computer apparatus and a storage medium, to overcome the problems of nonstandard SQL statements in the current techniques leading to compromised performance and less readability of the SQL scripts.

[0006] Aiming at the forementioned one or more technical problems, the present invention provides the associated technical proposals. In particular:

[0007] From the first perspective, a SQL optimization method is provided, wherein comprises:

[0008] parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

Date Recue/Date Received 2021-09-23

[0009] on a target platform, acquiring the record volume of a pending SQL
script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL statements;

[0010] classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and

[0011] optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

[0012] Furthermore, the described primary tables and the described secondary tables are classified according to the described record volume and pre-set classification rules, and the classification result is obtained by:

[0013] determining if the record volume of the described primary tables and the described secondary tables exceeds the pre-set threshold that: where if the described record volume exceeds the described threshold, identifying the described secondary tables as large tables, otherwise, identifying the described secondary tables as direct small tables; and

[0014] using subqueries for duplication removal of tables related to the described secondary tables identified as large tables according to a joining KEY and defining the deduplicated tables as being internally joined to the described secondary tables identified as a large tables with collecting post-join record volume, then determining if the described post-join record volume exceeds the pre-set threshold, where if the described post-join record volume remains less than the pre-set threshold, identifying the described secondary tables as indirect small tables.

[0015] In particular, the described process, wherein the described pending SQL script is optimized according to the described pre-set optimization rule and the described classification result to obtain the optimized SQL script, comprises at least:

[0016] using subqueries for duplication removal of tables related to the described indirect small tables, defining the deduplicated tables as being internally joined to the described indirect small tables, and generating a first temporary table.

Date Recue/Date Received 2021-09-23

[0017] In particular, the described process, wherein the described pending SQL script is optimized according to the described pre-set optimization rule and the described classification result to obtain the optimized SQL script, comprises at least:

[0018] identifying the target large tables scanned repeatedly from the described large tables, acquiring a target key and filtering options according to the parsing result of the described SQL
statements, obtaining the target data from the described target large table according to the described the target key and the described filtering options, generating a second temporary table so as to replace the target large table with the second temporary table for further uses.

[0019] In particular, the described process, wherein the described pending SQL script is optimized according to the described pre-set optimization rule and the described classification result to obtain the optimized SQL script, comprises at least:

[0020] combining the described large tables containing the same granularity and the same joining KEYs as the primary table, with the described secondary temporary table, and generating a first result table.

[0021] In particular, the described process, wherein the described pending SQL script is optimized according to the described pre-set optimization rule and the described classification result to obtain the optimized SQL script, comprises at least:

[0022] combining the described large tables containing the different granularity and the different joining KEYs from the primary table, the described secondary temporary table, and the described first result table, to generate a second result table; and

[0023] joining the described second result table containing the described direct small tables and the described indirect small tables, to generate a target table for the optimized SQL script.

[0024] Furthermore, the forementioned method also comprises:

[0025] acquiring specifications of the described target platform, then adjusting parameters of the pending SQL script according to the described platform parameter and the pre-set optimization rules.

Date Recue/Date Received 2021-09-23

[0026] From the second perspective, a SQL optimization device is provided, wherein comprises:

[0027] a parsing module configured to parse SQL statements and identify primary tables and the secondary tables of the described SQL statements;

[0028] a collecting module configured to acquire the record volume of the pending SQL script corresponding to the described primary tables and the described secondary tables for executing the described SQL statements on the target platform;

[0029] a classifying module configured to classify the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and

[0030] an optimizing module configured to optimize the described pending SQL
script according to the pre-set optimization rules and the described classification result, then obtain the optimized SQL script

[0031] From the third prospective, a computer apparatus is provided, comprising a memory unit, a processor, and computer programs stored in the memory unit being executable on the processor.
Particularly, when the described processor executes the described computer programs, the following procedures are performed:

[0032] parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

[0033] on a target platform, acquiring the record volume of a pending SQL
script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL statements;

[0034] classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and

[0035]
optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

Date Recue/Date Received 2021-09-23

[0036] From the fouth perspective, a readable computer storage medium carrying computer programs is provided. Particularly, when the described computer programs are executed on the described processor, the following procedures are performed:

[0037] parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

[0038] on a target platform, acquiring the record volume of a pending SQL
script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL statements;

[0039] classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and

[0040]
optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

[0041] The technical strategies in the present invention result in the benefits of:

[0042] 1. The SQL optimization method, the device, the computer apparatus and the storage medium in the present invention comprise: parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements; on a target platform, acquiring the record volume of a pending SQL script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL
statements; classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script. On the one hand, the method and the device can improve the readability of the script by analyzing the execution plan of the pending SQL script, intelligently reassembling and generating the optimized SQL script, then formatting the optimized SQL script with the standard output specifications. On the other hand, a large amount of time and energy of developers will be saved by obtaining high-quality professional SQL
statements to greatly reduce the labor cost.
Date Recue/Date Received 2021-09-23

[0043] 2. The SQL optimization method, the device, the computer apparatus and the storage medium in the present invention can improve the readability of the scripts by acquiring specifications of the described target platform, and adjusting the parameters of the pending SQL
script according to the described platform parameters and the pre-set optimization rules.

[0044] Any of the products in the present invention does not necessarily carry all the forementioned benefits.
Brief descriptions of the drawings

[0045] In order to make the technical strategies of the present invention clearer, the accompany drawings for the present invention will be briefly introduced below.
Obviously, the following drawings in the descriptions are only a portion of embodiments of the present invention. Those skilled in the art are able to generate other configurations according to the provided drawings without requiring any creative works.

[0046] Fig. 1 is a flow diagram of the SQL optimization method in an embodiment of the present invention.

[0047] Fig. 2 is a structure diagram of the SQL optimization device in an embodiment of the present invention.

[0048] Fig. 3 is an internal structure diagram of the SQL optimization apparatus in an embodiment of the present invention.
Detailed descriptions

[0049] In order to make the objective, the technical scheme, and the advantages of the present invention clearer, the present invention will be explained further in detail precisely below with references to the accompany drawings. Obviously, the embodiments described below are only a portion of embodiments of the present invention and cannot represent all possible embodiments.
Based on the embodiments in the present invention, the other applications by those skilled in the art without any creative works are falling within the scope of the present invention.

[0050] As mentioned in background information, aiming at the forementioned technical problems, the present invention provides a SQL optimization method. By extracting the platform parameters of the target platform (e.g., HVIE platform) and the most recent executing record volume of the Date Recue/Date Received 2021-09-23 tables involved in the SQL scripts (including the primary tables and the secondary tables), analyzing the execution plan of the pending SQL script, intelligently reassembling and generating the optimized SQL script, adjusting the SET parameter settings of the pending scripts, and formatting the HIVE SQL script with the standard output specifications, the readability of the scripts is improved. In the meanwhile, a large amount of time and energy of the developers are saved to greatly reduce the labor cost.

[0051] Fig. 1 is a flow diagram of the SQL optimization method in an embodiment of the present invention. The mentioned method in Fig. 1 comprises the following procedures:

[0052] 51: parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements.

[0053] In particular, after parsing the original SQL statements, the primary tables and secondary tables associated with the mentioned SQL statements are identified based on the output target table and keywords (e.g., left join, inner join, join, right join, etc.). To clarify, in the embodiment of the present invention, the primary tables imply the core output table, wherein have the same granularity to the target tables. The secondary tables imply the tables used to generate dimension keys. Taking the left join as an example, the secondary table is the table following left join.

[0054] S2: on the target platform, acquiring the record volume of a pending SQL script to be optimized, wherein corresponds to the primary tables and the secondary tables for executing the SQL statements;

[0055]
In particular, in the embodiment of the present invention, the pending SQL
script is corresponding to the SQL statements in step 51. After identifying a target platform of the pending SQL scripts (e.g., HIVE platform), the record volume of the pending SQL
script, corresponding to the primary tables and the secondary tables for executing the SQL statements, is acquired. In detail, the record volume in the most recent execution of each table (including the primary tables and the secondary tables) can be extracted according to log information of the target platform. In terms of the tables (including the primary tables and the secondary tables) not available in the log information, the record volume of the table can be collected with the count (*) method.

[0056] S3: classifying the primary tables and the secondary tables according to the record volume and pre-set classification rules to obtain a classification result.

Date Recue/Date Received 2021-09-23

[0057] In particular, in the embodiment of the present invention, an optimization rule is pre-set according to the record volume. Then the primary tables and the secondary tables are classified according to the record volume and the pre-set classification rules to obtain a classification result.

[0058] S4: optimizing the pending SQL script according to the pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

[0059] In particular, in the embodiment of the present invention, an optimization rule is pre-set according to practical requirements and empirical optimization experiences.
The mentioned optimization rule performs different optimizing procedures for different types of tables from the classification result, then reassemble the scripts based on the optimizing result of different types of tables to generate the optimized SQL scripts.

[0060] In a preferred embodiment of the present invention, the described method of classifying the primary tables and the secondary tables according to the record volume and the pre-set classification rules allows the attainment of the classification result by at least:

[0061]
determining if the record volume of the described primary tables and the described secondary tables exceeds the pre-set threshold that: where if the described record volume exceeds the described threshold, identifying the described secondary tables as large tables, otherwise, identifying the described secondary tables as direct small tables; and

[0062] using subqueries for duplication removal of tables related to the described secondary tables identified as large tables according to the joining KEY and defining the deduplicated tables as being internally joined to the described secondary tables identified as a large tables with collecting post-join record volume, then determining if the described post-join record volume exceeds the pre-set threshold, where if the described post-join record volume remains less than the pre-set threshold, identifying the described secondary tables as indirect small tables.

[0063] In particular, in the embodiment of the present invention, the classification results include, but not limited to, large tables, direct small tables, in-direct small tables, etc. The detailed classification process follows the procedures of:

[0064] after comparing the record volumes of the primary tables and the secondary tables in the forementioned steps and determining if the record volume exceeds the pre-set threshold, Date Recue/Date Received 2021-09-23 classifying the primary tables or secondary tables with the record volume greater than the threshold as large tables (COMMON JOIN), while the primary tables or the secondary tables with the record volume less than the threshold are classified as direct small tables (MAPJOIN). Particularly, in order to reduce the calculation volume in the following steps, in the embodiment of the present invention, first identifying the secondary tables classified as large tables (build table) is necessary.
Subqueries are used for duplication removal of tables related to the described secondary tables identified as large tables according to joining KEYs, and the deduplicated tables are defined as being internally joined to the described secondary tables identified as large tables with collecting post-join record volume. By determining if the described post-join record volume exceeds the pre-set threshold, the described secondary tables are identified as large table if the post-join record volume is still greater than the threshold. Otherwise, the described secondary tables are identified as indirect small tables. To clarify, the pre-set threshold in the embodiments in the present invention can be revised based on practical requirements, and are not limited by the descriptions.

[0065] In a preferred embodiment of the present invention, wherein the pending SQL script is optimized according to the pre-set optimization rule and the classification result, the optimized SQL script is obtained by at least:

[0066] using subqueries for duplication removal of tables related to the described indirect small tables, defining the deduplicated tables as being internally joined to the described indirect small tables, and generating a first temporary table.

[0067] In particular, the temporary tables are introduced herein for processing the join of the small tables. In details, by using subqueries for duplication removal of tables related to the described indirect small tables and defining the deduplicated tables as being internally joined to the described indirect small tables, a first temporary table is generated. By a single-step process, the join of the small tables is performed independently, and parameter settings are optimized properly. The MAPJOIN commands are processed in a combined manner to reduce the repeated combination of the MAPJOIN command, and consequently improve the efficiency.

[0068] In a preferred embodiment of the present invention, wherein the pending SQL script is optimized according to the pre-set optimization rule and the classification result, the optimized SQL script is obtained by at least:

Date Recue/Date Received 2021-09-23

[0069] identifying the target large tables that are scanned repeatedly from the described large tables, acquiring a target key and filtering options according to the parsing result of the described SQL statements, obtaining the target data from the described target large table according to the described the target key and the described filtering options, generating a second temporary table so as to replace the target large table with the second temporary table for further uses..

[0070] In particular, when the pending SQL scripts are optimized according to the pre-set optimization rule and the classification result, for the large tables obtained in the classification result of the forementioned procedures, the existence of the repeatedly scanned tables can be determined in the first place. In detail, by parsing the pending SQL scripts and counting the use times of each table (including the primary tables and the secondary tables), the existence of the repeatedly scanned tables are determined based on the counted use times. To clarify, the steps of parsing the pending SQL scripts and counting the use times of each table can be performed either before S3 or after S3, and are not limited. If the forementioned steps are performed after S3, counting the use times of the primary tables and secondary tables identified as large tables is sufficient.

[0071] The identified repeatedly scanned large tables can be processed with the temporary tables.
In detail, the required keys and filtering options is acquired according to the parsing result of the described SQL statements, noted as the target key. Then, the target data is obtained from the large tables based on the target key and the filtering options synchronously to generate a second temporary table so as to replace the target large table with the second temporary table for further uses. In other words, the table scanning times and the TO pressure are reduced.

[0072] In a preferred embodiment of the present invention, wherein the pending SQL script is optimized according to the pre-set optimization rule and the classification result, the optimized SQL script is obtained by at least:

[0073] combining the described large tables containing the same granularity and the same joining KEYs as the primary table with the described secondary temporary table, and generating a first result table.

[0074] Particularly, in the embodiments in the present invention, the join of the large tables containing the same granularity as the primary tables is processed by combining the large tables Date Recue/Date Received 2021-09-23 containing the same granularity and same joining KEY as the primary tables with the second temporary table synchronously. The subqueries are used to process the table joins and to ensure the completion of joining one MR with the same joining KEY to the core tables.

[0075] In a preferred embodiment of the present invention, wherein the pending SQL script is optimized according to the pre-set optimization rule and the classification result, the optimized SQL script is obtained by at least:

[0076] combining the described large tables containing the different granularity and the different joining KEYs from the primary table, the described secondary temporary table, and the described first result table, to generate a second result table; and

[0077] joining the described second result table containing the described direct small tables and the described indirect small tables, to generate a target table for the optimized SQL script.

[0078] Particularly, in the embodiments in the present invention, the join of the large tables with different granularity from the primary tables is processed by combining the large tables containing the different granularity and different joining KEY from the primary tables with the second temporary table and the first result table synchronously. The same joining KEYs are placed in adjacent positions in the tables. In the process, having data dumping or a large number of null or default values in joining KEYs may possibly happen, and consequently, these data are concentrated in one section. The null or default values of the joining KEYs should be processed and distributed to different data sections for the join process. In the end, the left join of the second result table, the direct small table and the in-direct small tables is performed to generate the target table. The mentioned target table is the target table for the optimized SQL
scripts.

[0079] In a preferred embodiment of the present invention, the described method also includes:

[0080] acquiring parameters of the described target platform, and adjusting the parameters of the pending SQL script according to the described platform parameters and pre-set optimization rules.

[0081] In particular, in the embodiment of the present invention, the optimization process for optimizing parameter settings for the SQL scripts is also included. In detail, first the target platform (e.g., HIVE platform) is connected, followed by extracting the platform parameter information, wherein the platform parameter information includes, but not limited to, the related parameter Date Recue/Date Received 2021-09-23 information such as the number of platform nodes, MAP, REDUCE, etc. Then, the improper SET
parameter settings are adjusted in the parameter settings for optimizing the SQL scripts according to the platform parameter information and sizes of HDFS files. Consequently, the optimization of the parameter settings for the SQL scripts is achieved. The detailed adjustment rules can be revised based on practical requirements and empirical optimization experiences, and are not explained in detail herein.

[0082] In a preferred embodiment of the present invention, the described method also includes:

[0083] the optimization process for data dumping.

[0084]
In particular, the secondary tables classified as large tables containing the different granularity from the primary tables, the distribution of the record volumes of the primary table joining KEYs is measured. The proper solving strategies for data dumping are selected, for example:

[0085] Measure 1: when the joining KEY has a null value or the default value is dumped, the null value or the default value are processed and split with random numbers;

[0086] Measure 2: in the counted distribution record volume of the counted joining KEY, a temporary small table is generated by cropping along the descending sorted rows. The mentioned temporary small table is joined with the direct small table, in-direct small table, while the rest records are used to generate additional temporary large table. The large tables in various stages are joined based on recognition of same or different granularity of the large table to the primary table using the joining KEYs.

[0087] In a preferred embodiment of the present invention, wherein the pending SQL script is optimized according to the pre-set optimization rule and the classification result, the optimized SQL script is obtained by at least:

[0088] refining the key annotations according to the source tables and source information of the target key; and according to the standard layout format of the tool settings, reorganizing the layout of the HIVE SQL scripts to improve the readability of the resultant codes.

[0089] Fig. 2 is a structure diagram of the SQL optimization device in an embodiment of the present invention, wherein comprises:

Date Recue/Date Received 2021-09-23

[0090] a parsing module configured to parse SQL statements and identify primary tables and the secondary tables of the described SQL statements;

[0091] a collecting module configured to acquire the record volume of the pending SQL script corresponding to the described primary tables and the described secondary tables for executing the described SQL statements on the target platform;

[0092] a classifying module configured to classify the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and

[0093] an optimizing module configured to optimize the described pending SQL
script according to the pre-set optimization rules and the described classification result, then obtain the optimized SQL script

[0094] In a preferred embodiment of the present invention, the described classification module includes:

[0095]
the first classification unit, configured for determining if the record volume of the described primary tables and the described secondary tables exceeds the pre-set threshold that:
where if the described record volume exceeds the described threshold, the described secondary tables are identified as large tables, otherwise, the described secondary tables are identified as direct small tables.

[0096] the second classification unit, using subqueries for duplication removal of tables related to the described secondary tables identified as large tables according to joining KEY, and defining the deduplicated tables as being internally joined to the described secondary tables identified as a large tables with collecting post-join record volume, then determining if the described post-join record volume exceeds the pre-set threshold, where if the described post-join record volume remains less than the pre-set threshold the described secondary tables are identified as indirect small tables.

[0097] In a preferred embodiment of the present invention, the described optimization module is configured for:

Date Recue/Date Received 2021-09-23

[0098] using subqueries for duplication removal of tables related to the described indirect small tables, defining the deduplicated tables as being internally joined to the described indirect small tables, and generating the first temporary table.

[0099] In a preferred embodiment of the present invention, the described optimization module is configured for:

[0100] identifying the target large tables scanned repeatedly from the described large tables, acquiring a target key and filtering options according to the parsing result of the described SQL
statements, obtaining the target data from the described target large table according to the described the target key and the described filtering options, generating a second temporary table so as to replace the target large table with the second temporary table for further uses.

[0101] In a preferred embodiment of the present invention, the described optimization module is configured for:

[0102] combining the described large tables containing the same granularity and the same joining KEYs as the primary table with the described secondary temporary table, and generating a first result table.

[0103] In a preferred embodiment of the present invention, the described optimization module is configured for:

[0104] combining the described large tables containing the different granularity and the different joining KEYs from the primary table, the described secondary temporary table, and the described first result table, to generate a second result table; and

[0105] joining the described second result table containing the described direct small tables and the described indirect small tables, to generate a target table for the optimized SQL script.

[0106] In a preferred embodiment of the present invention, the described device also includes:

[0107] an adjusting module, configured for acquiring parameters of the described target platform, and adjusting the parameters of the pending SQL script according to the described platform parameters and the pre-set optimization rules.

Date Recue/Date Received 2021-09-23

[0108]
Fig. 3 is an internal structure diagram of the SQL optimization apparatus in an embodiment of the present invention. The computer apparatus shown in Fig. 3 comprises a processor, a memory unit, a network connection port wherein the described components are connected by system bus control. The processor of the mentioned computer apparatus is configured for providing calculation and control. The memory unit of the computer apparatus includes a nonvolatile storage medium and an internal memory. The operating system, computer programs, and databases are stored in the nonvolatile storage medium. The internal memory provides the operation environment for the execution of the operating system and the computer programs stored in the nonvolatile storage medium. The network connection port of the computer apparatus is configured for communication with the external terminals via network connection.
The operation of the described computer apparatus by the processor permits an optimization method for execution plans.

[0109] It is comprehensible for those skilled in the art that the structure shown in Fig. 3 represents only a portion of structure associated with the applications of the present invention. The computer apparatus associated with the applications of the present invention are not restricted or limited by the described structure. An exact computer apparatus may include more components or less components than that is shown in Fig. 3, possibly with combinations of some components or different component layouts.

[0110] In a preferred embodiment of the present invention, a computer apparatus is provided, comprising a memory unit, a processor, and computer programs stored in the memory unit and executable on the processor. When the processor executes the computer programs, the following steps are performed:

[0111] parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

[0112] on a target platform, acquiring the record volume of a pending SQL
script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL statements;

[0113] classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and Date Recue/Date Received 2021-09-23

[0114] optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

[0115] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0116] determining if the record volume of the described primary tables and the described secondary tables exceeds the pre-set threshold that: where if the described record volume exceeds the described threshold, identifying the described secondary tables as large tables, otherwise, identifying the described secondary tables as direct small tables; and

[0117] using subqueries for duplication removal of tables related to the described secondary tables identified as large tables according to a joining KEY and defining the deduplicated tables as being internally joined to the described secondary tables identified as a large tables with collecting post-join record volume, then determining if the described post-join record volume exceeds the pre-set threshold, where if the described post-join record volume remains less than the pre-set threshold, identifying the described secondary tables as indirect small tables.

[0118] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0119] using subqueries for duplication removal of tables related to the described indirect small tables, defining the deduplicated tables as being internally joined to the described indirect small tables, and generating the first temporary table.

[0120] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0121] identifying the target large tables that are scanned repeatedly from the described large tables, acquiring the target key and filtering options according to the parsing result of the described SQL statements, obtaining the target data from the described target large table according to the described the target key and the described filtering options, generating the second temporary table, and replacing the described target large table with the described second temporary table wherein will be explained in detail in the following sections.

Date Recue/Date Received 2021-09-23

[0122] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0123] combining the described large tables containing the same granularity and the same joining KEYs as the primary table with the described secondary temporary table, and generating the first result table.

[0124] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0125] combining the described large tables containing the different granularity and the different joining KEYs from the primary table, the described secondary temporary table, and the described first result table, to generate the second result table; and

[0126] joining the described second result table containing the described direct small tables and the described indirect small tables, to generate the target table for the optimized SQL script.

[0127] In a preferred embodiment of the present invention, when the processor executes the computer programs, the following steps are also performed:

[0128] acquiring parameters of the described target platform, and adjusting the parameters of the pending SQL script according to the described platform parameters and the pre-set optimization rules.

[0129] In an embodiment of the present invention, a computer readable storage media is provided, on wherein the computer program is stored. When the execution of the computer programs on the processor, the following steps are performed:

[0130] parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

[0131] on a target platform, acquiring the record volume of a pending SQL
script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL statements;

[0132] classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and Date Recue/Date Received 2021-09-23

[0133] optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

[0134] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

[0135] determining if the record volume of the described primary tables and the described secondary tables exceeds the pre-set threshold that: where if the described record volume exceeds the described threshold, identifying the described secondary tables as large tables, otherwise, identifying the described secondary tables as direct small tables; and

[0136] using subqueries for duplication removal of tables related to the described secondary tables identified as large tables according to a joining KEY and defining the deduplicated tables as being internally joined to the described secondary tables identified as a large tables with collecting post-join record volume, then determining if the described post-join record volume exceeds the pre-set threshold, where if the described post-join record volume remains less than the pre-set threshold, identifying the described secondary tables as indirect small tables.

[0137] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

[0138] using subqueries for duplication removal of tables related to the described indirect small tables, defining the deduplicated tables as being internally joined to the described indirect small tables, and generating a first temporary table.

[0139] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

[0140] identifying the target large tables scanned repeatedly from the described large tables, acquiring a target key and filtering options according to the parsing result of the described SQL
statements, obtaining the target data from the described target large table according to the described the target key and the described filtering options, generating a second temporary table so as to replace the target large table with the second temporary table for further uses.

[0141] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

Date Recue/Date Received 2021-09-23

[0142] combining the described large tables containing the same granularity and the same joining KEYs as the primary table with the described secondary temporary table, and generating a first result table.

[0143] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

[0144] combining the described large tables containing the different granularity and the different joining KEYs from the primary table, the described secondary temporary table, and the described first result table, to generate a second result table; and

[0145] joining the described second result table containing the described direct small tables and the described indirect small tables, to generate the target table for the optimized SQL script.

[0146] In a preferred embodiment of the present invention, when the computer programs are executed on the processor, the following steps are also performed:

[0147] acquiring parameters of the described target platform, and adjusting the parameters of the pending SQL script according to the described platform parameters and the pre-set optimization rules.

[0148] In summary, the technical strategies in the present invention result in the benefits of:

[0149] 1. The SQL optimization method, the device, the computer apparatus and the storage medium in the present invention comprise: parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements; on a target platform, acquiring the record volume of a pending SQL script to be optimized, wherein the record volume corresponds to the described primary tables and the described secondary tables for executing the described SQL
statements; classifying the described primary tables and the described secondary tables according to the described record volume and pre-set classification rules to obtain a classification result; and optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script. On the one hand, the method and the device can improve the readability of the script by analyzing the execution plan of the pending SQL script, intelligently reassembling and generating the optimized SQL script, then formatting the optimized SQL script with the standard output specifications. On the other hand, a large amount of time and Date Recue/Date Received 2021-09-23 energy of developers will be saved by obtaining high-quality professional SQL
statements to greatly reduce the labor cost.

[0150] 2. The SQL optimization method, the device, the computer apparatus and the storage medium in the present invention can improve the readability of the scripts by acquiring specifications of the described target platform, and adjusting the parameters of the pending SQL
script according to the described platform parameters and the pre-set optimization rules.

[0151]
To clarify, when the SQL optimization device provided in the forementioned embodiments performs an optimization service, only the configurations of each described module are explained as examples. In practical applications, the described functions can be assigned to different functional modules according to practical requirements. In other words, the internal structure of the device can be configured with different functional modules to perform all or the portions of the described functions. Furthermore, the SQL optimization device and the SQL
optimization method provided in the forementioned embodiments have the same conceptual design, implying that the described device is constructed based on the described SQL
optimization method.
The detailed specifications of the optimization process may be referred to the forementioned specifications of the SQL optimization method, and the details are not described herein.

[0152] Those skilled in the art can understand and achieve the forementioned all or portions of the procedures via hardware or via programs sending commands to the related hardware, wherein the described programs can be stored in a computer readable storage medium.
The described storage medium may be read-only memory, magnetic disks, CDs, etc.

[0153] The forementioned contents of preferred embodiments of the present invention, and shall not limit the applications of the present invention. Therefore, all alternations, modifications, equivalence, improvements of the present invention fall within the scope of the present invention.
Date Recue/Date Received 2021-09-23

Claims

Claims:

1. An SQL optimization method comprising:
parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;
on a target platfomi, acquiring a record volume of a pending SQL script to be optimized, wherein the record volume corresponds to the primary tables and the secondary tables for executing the SQL statements;
classifying the primary tables and the secondary tables according to the record volume and pre-set classification rules to obtain a classification result; and optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

2. The method of claim 1, further includes:
detemiining if the record volume of the primary tables and the secondary tables exceeds a pre-set threshold; and where if the record volume exceeds the pre-set threshold, identifying the secondary tables as large tables, otherwise, identifying the secondary tables as direct small tables;
using subqueries for duplication removal of tables related to the secondary tables identified as large tables according to joining KEY, and defining a deduplicated tables as being internally joined to the secondary tables identified as large tables with collecting post-join record volume;
detemiining if the post-join record volume exceeds the pre-set threshold; and where if the post-join record volume remains less than the pre-set threshold, identifying Date Recue/Date Received 2022-07-05 the secondary tables as indirect small tables.

3. The method of claim 2, further includes:
using subqueries for duplication removal of tables related to the indirect small tables, defining the deduplicated tables as being internally joined to the indirect small tables, and generating a first temporary table.

4. The method of claim 2, further includes:
identifying the target large tables that are scanned repeatedly from the large tables, acquiring a target key and filtering options according to a parsing result of SQL statements;
obtaining the target data from the target large table according to the target key and the filtering options, generating the second temporary table so as to replace the target large table with the second temporary table for further uses.

5. The method of claim 2, further includes:
combining the large tables containing the same granularity and the same joining KEYs as a primary table with a secondary temporary table, and generating a first result table.

6. The method of claim 5, further includes:
combining the large tables containing different granularity and different joining KEYs from the primary table, the secondary temporary table, and the first result table, to generate a second result table; and joining the second result table containing direct small tables and the indirect small tables, to generate a target table for an optimized SQL script.

7. The method of any one of claims 1 to 6, further includes:
acquiring parameters of the target platform, and adjusting the parameters of the pending SQL script according to the parameters of the target platform and the pre-set optimization rule.

Date Recue/Date Received 2022-07-05

8. The method of any one of claims 1 to 7, wherein the primary tables have the same granularity to target tables.

9. The method of any one of claims 1 to 8, wherein the secondary tables are a plurality of tables used to generate dimension keys.

10. The method of any one of claims 1 to 9, wherein the record volume is extracted in a most recent execution of each primary table according to log information of the target platform.

11. The method of claim 10, wherein the record volume is extracted in the most recent execution of each secondary table according to log information of the target platform.

12. The method of any one of claims 1 to 11, wherein the classification rules are pre-set according to the record volume.

13. The method of any one of claims 1 to 12, wherein the optimization rule is pre-set according to practical requirements and empirical optimization experiences.

14. The method of any one of claims 1 to 13, wherein the classification results include large tables.

15. The method of any one of claims 1 to 14, wherein the classification results include direct small tables.

16. The method of any one of claims 1 to 15, wherein the classification results include the indirect small tables.

17. The method of any one of claims 1 to 16, wherein the target platform includes a HIVE platform.

Date Recue/Date Received 2022-07-05

18. The method of claim 7, wherein the parameter of the target platform includes a number of platform nodes.

19. A SQL optimization device, comprising:
a parsing module configured to parse SQL statements, and identifying primary tables and secondary tables related to the SQL statements;
a collecting module configured to, on a target platform, acquire a record volume of a pending SQL script to be optimized, wherein the record volume corresponds to the primary tables and the secondary tables for executing the SQL statements;
a classifying module configured to classify the primary tables and the secondary tables according to the record volume and pre-set classification rules to obtain a classification result;
and an optimizing module optimize the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

20. The device of claim 19, further includes a first classification unit.

21. The device of claim 20, wherein the first classification unit is configured to:
detemiine if the record volume of the primary tables and the secondary tables exceeds a pre-set threshold; and where if the record volume exceeds the pre-set threshold, identifying the secondary tables as large tables, otherwise, identify the secondary tables as direct small tables.

22. The device of any one of claims 19 to 21, further includes a second classification unit.

Date Recue/Date Received 2022-07-05

23. The device of claim 22, wherein the second classification unit is configured to:
use subqueries for duplication removal of tables related to the secondary tables identified as large tables according to joining KEY, and define a deduplicated tables as being internally joined to the secondary tables identified as large tables with collecting post-join record volume;
deteimine if the post-join record volume exceeds the pre-set threshold; and where if the post-join record volume remains less than the pre-set threshold, identifying the secondary tables as indirect small tables.

24. The device of claim 23, wherein the optimization module is further configured to:
use subqueries for duplication removal of tables related to the indirect small tables, define the deduplicated tables as being internally joined to the indirect small tables, and generate a first temporary table.

25. The device of any one of claims 19 to 24, wherein the optimization module is further configured to:
identify the target large tables that are scanned repeatedly from the large tables, acquire a target key and filtering options according to a parsing result of SQL
statements;
obtain the target data from the target large table according to the target key and the filtering options, generate the second temporary table so as to replace the target large table with the second temporary table for further uses.

26. The device of any one of claims 19 to 25, wherein the optimization module is further configured to:
combine the large tables containing the same granularity and the same joining KEYs as a Date Recue/Date Received 2022-07-05 primary table with a secondary temporary table, and generate a first result table.

27. The device of any one of claims 19 to 26, wherein the optimization module is further configured to:
combine the large tables containing different granularity and different joining KEYs from the primary table, the secondary temporary table, and the first result table, to generate a second result table; and join the second result table containing direct small tables and the indirect small tables, to generate a target table for an optimized SQL script.

28. The device of any one of claims 19 to 27, further includes an adjusting module.

29. The device of claim 28, wherein the adjusting module is further configured to:
acquire parameters of the target platform, and adjust the parameters of a pending SQL
script according to the parameters of the target platform and a pre-set optimization rule.

30. The device of any one of claims 19 to 29, wherein the primary tables have the same granularity to target tables.

31. The device of any one of claims 19 to 29, wherein the secondary tables are a plurality of tables used to generate dimension keys.

32. The device of any one of claims 19 to 31, wherein the record volume is extracted in a most recent execution of each primary table according to log informration of the target platform.

Date Recue/Date Received 2022-07-05

33. The device of claim 32, wherein the record volume is extracted in the most recent execution of each secondary table according to log information of the target platform.

34. The device of any one of claims 19 to 33, wherein the classification rules are pre-set according to the record volume.

35. The device of any one of claims 19 to 34, wherein the optimization rule is pre-set according to practical requirements and empirical optimization experiences.

36. The device of any one of claims 19 to 35, wherein the classification results include large tables.

37. The device of any one of claims 19 to 36, wherein the classification results include direct small tables.

38. The device of any one of claims 19 to 37, wherein the classification results include the indirect small tables.

39. The device of any one of claims 19 to 38, wherein the target platform includes a HIVE platform.

40. The device of claim 29, wherein the parameter of the target platform includes a number of platform nodes.

41. A SQL optimization system, comprising:
a memory unit for storing processing data;
a processor for executing computer programs, wherein the computer programs including:
parsing SQL statements, and identifying primary tables and secondary tables related to the SQL statements;

Date Recue/Date Received 2022-07-05 on a target platfomi, acquiring a record volume of a pending SQL script to be optimized, wherein the record volume corresponds to the primary tables and the secondary tables for executing the SQL statements;
classifying the primary tables and the secondary tables according to the record volume and pre-set classification rules to obtain a classification result; and optimizing the pending SQL script according to a pre-set optimization rule and the classification result, then obtaining the optimized SQL script.

42. The system of claim 41, wherein the computer programs further include:
detemiining if the record volume of the primary tables and the secondary tables exceeds a pre-set threshold; and where if the record volume exceeds the pre-set threshold, identifying the secondary tables as large tables, otherwise, identifying the secondary tables as direct small tables;
using subqueries for duplication removal of tables related to the secondary tables identified as large tables according to joining KEY, and defining a deduplicated tables as being internally joined to the secondary tables identified as large tables with collecting post-join record volume;
detemiining if the post-join record volume exceeds the pre-set threshold; and where if the post-join record volume remains less than the pre-set threshold, identifying the secondary tables as indirect small tables.

43. The system of claim 41, wherein the computer programs further include:
using subqueries for duplication removal of tables related to the indirect small tables, Date Recue/Date Received 2022-07-05 defining the deduplicated tables as being internally joined to the indirect small tables, and generating a first temporary table.

44. The system of claim 41, wherein the computer programs further include:
identifying the target large tables that are scanned repeatedly from the large tables, acquiring a target key and filtering options according to a parsing result of SQL statements;
obtaining the target data from the target large table according to the target key and the filtering options, generating the second temporary table so as to replace the target large table with the second temporary table for further uses.

45. The system of claim 41, wherein the computer programs further include:
combining the large tables containing the same granularity and the same joining KEYs as a primary table with a secondary temporary table, and generating a first result table.

46. The system of claim 45, wherein the computer programs further include:
combining the large tables containing different granularity and different joining KEYs from the primary table, the secondary temporary table, and the first result table, to generate a second result table; and joining the second result table containing direct small tables and the indirect small tables, to generate a target table for an optimized SQL script.

47. The system of any one of claims 41 to 46, wherein the computer programs further include:
acquiring parameters of the target platform, and adjusting the parameters of the pending SQL script according to the parameters of the target platform and the pre-set optimization rule.

48. The system of any one of claims 41 to 47, wherein the primary tables have the same granularity to target tables.

Date Recue/Date Received 2022-07-05

49. The system of any one of claims 41 to 48, wherein the secondary tables are a plurality of tables used to generate dimension keys.

50. The system of any one of claims 41 to 49, wherein the record volume is extracted in a most recent execution of each primary table according to log informiation of the target platform.

51. The system of claim 50, wherein the record volume is extracted in the most recent execution of each secondary table according to log information of the target platform.

52. The system of any one of claims 41 to 51, wherein the classification rules are pre-set according to the record volume.

53. The system of any one of claims 41 to 52, wherein the optimization rule is pre-set according to practical requirements and empirical optimization experiences.

54. The system of any one of claims 41 to 53, wherein the classification results include large tables.

55. The system of any one of claims 41 to 54, wherein the classification results include direct small tables.

56. The system of any one of claims 41 to 55, wherein the classification results include the indirect small tables.

57. The system of any one of claims 41 to 56, wherein the target platform includes a HIVE platform.

58. The system of claim 47, wherein the parameter of the target platform includes a number of platform nodes.
Date Recue/Date Received 2022-07-05