CN112307050B

CN112307050B - Identification method and device for repeated correlation calculation and computer system

Info

Publication number: CN112307050B
Application number: CN202010973509.XA
Authority: CN
Inventors: 丁庆晏; 徐伟
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2022-11-15
Anticipated expiration: 2040-09-16
Also published as: CN112307050A; CA3130988A1

Abstract

The application discloses a method, a device and a computer system for identifying repeated correlation calculation, wherein the method comprises the steps of obtaining a first SQL statement and a second SQL statement to be identified; analyzing the first SQL statement, and identifying a first association query included in the first SQL statement, wherein the association query includes association calculation between data tables required by executing the SQL statement; analyzing the second SQL statement and identifying a second associated query included by the second SQL statement; when the first correlation query and the second correlation query have repeated correlation calculation, determining that the first SQL statement and the second SQL statement have repeated correlation calculation, and identifying whether the plurality of SQL statements include repeated correlation calculation, so that the SQL statements including repeated correlation calculation can be optimized and adjusted in the following process, and the operating efficiency of a data platform is further improved.

Description

Identification method and device for repeated correlation calculation and computer system

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, and a computer system for identifying duplicate association calculation.

Background

In a data processing scenario such as a big data offline task, a large number of SQL statements need to be processed. During the execution of a large number of SQL statements, repeated associative computations on two data tables often occur. Such repeated associated computation may result in a great amount of waste of computing resources and storage resources, seriously affect the operating efficiency of the data platform and increase the operating cost of the data platform. Therefore, a method capable of identifying repeated correlation calculations included in a plurality of SQL statements is highly desirable.

Disclosure of Invention

In order to solve the defects of the prior art, the present invention mainly aims to provide a method, an apparatus and a computer system for identifying duplicate association calculation included in an SQL statement.

In order to achieve the above object, the present invention provides, in a first aspect, a method for identifying a repetitive correlation calculation, the method including:

acquiring a first SQL statement and a second SQL statement to be identified;

analyzing the first SQL statement, and identifying a first association query included in the first SQL statement, wherein the association query includes association calculation between data tables required by executing the SQL statement;

analyzing the second SQL statement and identifying a second associated query included by the second SQL statement;

when the first correlation query and the second correlation query have repeated correlation calculation, determining that the first SQL statement and the second SQL statement have repeated correlation calculation.

In some embodiments, the association calculation includes a corresponding data table and an association relation keyword, where the association relation keyword is used to describe association calculation required among data tables, the parsing the first SQL statement, and identifying a first association query included in the first SQL statement includes:

analyzing the first SQL statement, and identifying a first incidence relation keyword contained in the first SQL statement and a first data table and a second data table corresponding to the first incidence relation keyword;

and determining first association calculation included in the first association query according to the first data table, the second data table and the first association relation key word.

In some embodiments, said parsing said first SQL statement, identifying a first associated query comprised by said first SQL statement comprises:

analyzing the first SQL statement to generate json data corresponding to the first SQL statement;

and identifying a first associated query included by the first SQL statement according to the json data.

In some embodiments, the second SQL statement comprises a sub-query and an associated query of the sub-query with a third data table, and the parsing the second SQL statement to identify a second associated query comprised by the second SQL statement comprises:

analyzing the second SQL statement, and identifying a second incidence relation keyword included in the sub-query and a fourth data table and a fifth data table corresponding to the second incidence relation keyword;

determining second association calculation included in the second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words;

identifying a third association relation keyword included in the second SQL statement and the third data table and the sub-query corresponding to the third association relation keyword;

determining third association calculation included in the second association query according to the third association relation key words, the third data table and the fourth data table;

and determining fourth association calculation included in the second association query according to the third association relation key words, the third data table and the fifth data table.

In some embodiments, the first SQL statement and the second SQL statement comprise corresponding to-be-processed data tables, the method comprising:

and when the corresponding data table to be processed is a temporary table, replacing the data table to be processed with a corresponding entity table.

In some embodiments, the determining that there is a duplicate association calculation between the first SQL statement and the second SQL statement when there is a duplicate association calculation between the first association query and the second association query comprises:

grouping the association calculation according to the association relation key words;

and when any group comprises the same association calculation of the corresponding data table, determining that the first SQL statement and the second SQL statement have repeated association calculation.

In a second aspect, the present application provides an apparatus for identifying duplicate association calculations, the apparatus comprising:

the acquisition module is used for acquiring a first SQL statement and a second SQL statement to be identified;

the analysis module is used for analyzing the first SQL statement and identifying a first correlation query included in the first SQL statement, wherein the correlation query includes correlation calculation between data tables required by execution of the SQL statement; analyzing the second SQL statement and identifying a second associated query included by the second SQL statement;

and the processing module is used for determining that the first SQL statement and the second SQL statement have repeated correlation calculation when the first correlation query and the second correlation query have repeated correlation calculation.

In some embodiments, the parsing module may be further configured to parse the first SQL statement, and identify a first incidence relation keyword included in the first SQL statement and a first data table and a second data table corresponding to the first incidence relation keyword; and determining first association calculation included in the first association query according to the first data table, the second data table and the first association relation key word.

In some embodiments, the second SQL statement includes a sub-query and an associated query of the sub-query and a third data table, and the parsing module is further configured to parse the second SQL statement, and identify a second association relation keyword included in the sub-query and a fourth data table and a fifth data table corresponding to the second association relation keyword; determining second association calculation included in the second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words; identifying a third association relation keyword included in the parsed second SQL statement, wherein the third association relation keyword is used for describing an association query of the sub-query and the third data table; determining third association calculation included in the second association query according to the third association relation key words, the third data table and the fourth data table; and determining fourth association calculation included in the second association query according to the third association relation key word, the third data table and the fifth data table.

In a third aspect, the present application provides a computer system comprising:

one or more processors;

and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

acquiring a first SQL statement and a second SQL statement to be identified;

analyzing the first SQL statement, and identifying a first correlation query included in the first SQL statement, wherein the correlation query includes correlation calculation between data tables required by execution of the SQL statement;

analyzing the second SQL statement and identifying a second associated query included in the second SQL statement;

when repeated correlation calculation exists in the first correlation query and the second correlation query, determining that repeated correlation calculation exists between the first SQL statement and the second SQL statement.

The invention has the following beneficial effects:

the application provides an identification method of repeated correlation calculation, which comprises the steps of obtaining a first SQL statement and a second SQL statement to be identified; analyzing the first SQL statement, and identifying a first association query included in the first SQL statement, wherein the association query includes association calculation between data tables required by executing the SQL statement; analyzing the second SQL statement and identifying a second associated query included by the second SQL statement; when the first correlation query and the second correlation query have repeated correlation calculation, determining that the first SQL statement and the second SQL statement have repeated correlation calculation, and identifying whether the plurality of SQL statements contain repeated correlation calculation, so that the SQL statements containing repeated correlation calculation can be optimized and adjusted subsequently, and the operating efficiency of a data platform is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating identification of a duplicate association calculation provided by an embodiment of the present application;

FIG. 2 is a flow diagram of a repetitive correlation computation of an identification task provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method provided by an embodiment of the present application;

FIG. 4 is a block diagram of an apparatus according to an embodiment of the present disclosure;

fig. 5 is a computer system structure diagram provided in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As described in the background art, in order to solve the above problems, the present application provides a method for identifying a repeated correlation calculation, as shown in fig. 1 and 2, the identification of the repeated correlation calculation using the method includes:

step one, acquiring an SQL statement to be analyzed;

the udf function can be used for extracting an original SQL query task from the Hive task and the sparkSQL task, and then extracting an SQL statement to be analyzed from the original task.

Analyzing the SQL sentences, and respectively generating corresponding Json data according to the association query obtained by analysis;

the udf function can be developed by using an antlr technology and is used for analyzing the SQL statement.

The udf function can analyze and obtain the incidence relation key words contained in the SQL sentence, the data table to be queried corresponding to the incidence relation key words, the database where the data table to be queried is located and the incidence conditions corresponding to the incidence relation key words according to the input SQL sentence, and generates corresponding Json data according to all the data obtained through analysis.

When the data table to be queried comprises a temporary table, the corresponding entity table can be replaced.

The incidence relation keywords can include Join and union. The Join includes Join, LEFT Join, INNER Join, RIGHT Join, CROSS Join, FULL Join, and NOT _ Join. In order to ensure the identification accuracy, analysis rules respectively corresponding to various types of SQL statements can be preset, wherein the types comprise the SQL statement containing JOIN, the SQL statement containing UNION and the SQL statement containing both JOIN and UNION or containing sub-queries. When the incidence relation key words contained in the SQL sentences are obtained through analysis, the SQL sentences are further analyzed according to the corresponding analysis rules so as to identify the data tables to be inquired corresponding to the incidence relation key words, the database where the data tables to be inquired are located, the incidence conditions corresponding to the incidence relation key words and the like.

Determining the associated query contained in the SQL statement according to the Json data corresponding to the SQL statement;

in order to identify the association query, one or more association calculations included in the association query need to be identified, and according to data included in the Json data, the association relation key words and the corresponding data tables included in the SQL statement can be determined, and the association relation key words and the corresponding data tables form one association calculation.

For example, according to Json data, the first SQL statement includes an associated query to table a and a sub-query of the T1 database, and the sub-query includes an associated query to table B and table C of the T1 database. The key word of the association relation between the table B and the table C is LEFT JOIN; the association relationship between the table A and the sub-query is Union, and the association condition is a first association condition. Determining that a first incidence relation keyword contained in the SQL statement is JOIN, the corresponding data tables to be queried are a table B and a table C, and a database where the data tables to be queried are located is T1; the obtained second incidence relation key words are Union, the corresponding data tables to be inquired are a table A and a table B, and the incidence condition is the first incidence condition; the third association key word is Union, the corresponding data tables to be queried are table A and table C, and the association condition is the first association condition. The first SQL statement comprises three associative computations, namely UNION associative computation of table a and table B, UNION associative computation of table a and table C, and JOIN associative computation of table B and table C.

And obtaining according to the Json data, wherein the SQL statement comprises the association query of a table B and a table C of the T1 database, the association relation key word is RIGHT JOIN, and the association operation included by the second SQL statement is the JOIN association calculation of the table B and the table C.

And fourthly, grouping the json data according to the included incidence relation keywords, and counting whether the json data which are included in the same group and have the same data table to be processed appear or not and the occurrence times.

And step five, when the occurrence frequency is not less than a preset threshold value, judging that json data which are the same as the included data table to be processed comprise repeated correlation calculation.

Preferably, when the number of occurrences is not less than 1, judging that json data identical to the included data table to be processed includes repeating the correlation calculation.

Since the first SQL statement and the second SQL statement both include JOIN correlation calculation for table B and table C, it may be determined that there is duplicate correlation calculation for the first SQL statement and the second SQL statement.

The hive table can be generated according to the identified repeated correlation calculation and the corresponding SQL statement and is provided for technical staff to refer, so that the technical staff can optimize and adjust the query process of the SQL statement and the SQL statement, and the operation efficiency of the system is improved.

Example two

Corresponding to the foregoing embodiments, the present application provides an identification method for repeated correlation calculation, as shown in fig. 3, the method includes:

310. acquiring a first SQL statement and a second SQL statement to be identified;

preferably, the first SQL statement and the second SQL statement include corresponding to-be-processed data tables, and the method includes:

311. and when the corresponding data table to be processed is a temporary table, replacing the data table to be processed with a corresponding entity table.

320. Analyzing the first SQL statement, and identifying a first association query included in the first SQL statement, wherein the association query includes association calculation between data tables required by executing the SQL statement;

preferably, the association calculation includes a corresponding data table and an association relation keyword, where the association relation keyword is used to describe association calculation that needs to be performed between data tables, and the analyzing the first SQL statement and identifying a first association query included in the first SQL statement includes:

321. analyzing the first SQL statement, and identifying a first incidence relation keyword contained in the first SQL statement and a first data table and a second data table corresponding to the first incidence relation keyword;

322. and determining first association calculation included in the first association query according to the first data table, the second data table and the first association relation key word.

Preferably, the parsing the first SQL statement and identifying the first associated query included in the first SQL statement includes:

323. analyzing the first SQL statement to generate json data corresponding to the first SQL statement;

324. and identifying a first associated query included by the first SQL statement according to the json data.

330. Analyzing the second SQL statement and identifying a second associated query included by the second SQL statement;

preferably, the second SQL statement includes a sub-query and an associated query of the sub-query and a third data table, and the analyzing the second SQL statement and identifying the second associated query included in the second SQL statement includes:

331. analyzing the second SQL statement, and identifying a second incidence relation keyword included in the sub-query and a fourth data table and a fifth data table corresponding to the second incidence relation keyword;

332. determining second association calculation included in the second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words;

333. identifying a third association relation keyword included in the second SQL statement and the third data table and the sub-query corresponding to the third association relation keyword;

334. determining third association calculation included in the second association query according to the third association relation key words, the third data table and the fourth data table;

335. and determining fourth association calculation included in the second association query according to the third association relation key word, the third data table and the fifth data table.

340. When repeated correlation calculation exists in the first correlation query and the second correlation query, determining that repeated correlation calculation exists between the first SQL statement and the second SQL statement.

Preferably, when there is a duplicate correlation calculation between the first correlation query and the second correlation query, the determining that there is a duplicate correlation calculation between the first SQL statement and the second SQL statement includes:

341. grouping the association calculation according to the association relation key words;

342. and when any group comprises the same association calculation of the corresponding data table, determining that the first SQL statement and the second SQL statement have repeated association calculation.

EXAMPLE III

In response to the above method, the present application proposes an apparatus for identifying duplicate association calculation, as shown in fig. 4, the apparatus including:

an obtaining module 410, configured to obtain a first SQL statement and a second SQL statement to be identified;

the parsing module 420 is configured to parse the first SQL statement, and identify a first association query included in the first SQL statement, where the association query includes association calculation between data tables that needs to be performed when the SQL statement is executed; analyzing the second SQL statement and identifying a second associated query included by the second SQL statement;

the processing module 430 is configured to determine that there is repeated correlation calculation between the first SQL statement and the second SQL statement when there is repeated correlation calculation between the first correlation query and the second correlation query.

Preferably, the parsing module 420 may be further configured to parse the first SQL statement, and identify a first incidence relation keyword included in the first SQL statement and a first data table and a second data table corresponding to the first incidence relation keyword; and determining first association calculation included in the first association query according to the first data table, the second data table and the first association relation key word.

Preferably, the second SQL statement includes a sub-query and an association query between the sub-query and a third data table, and the parsing module 420 may be further configured to parse the second SQL statement, and identify a second association relationship keyword included in the sub-query and a fourth data table and a fifth data table corresponding to the second association relationship keyword; determining second association calculation included in the second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words; identifying a third association relation keyword included in the parsed second SQL statement, wherein the third association relation keyword is used for describing association query of the sub-query and the third data table; determining third association calculation included in the second association query according to the third association relation key words, the third data table and the fourth data table; and determining fourth association calculation included in the second association query according to the third association relation key word, the third data table and the fifth data table.

Preferably, the parsing module 420 is further configured to parse the first SQL statement to generate json data corresponding to the first SQL statement; and identifying a first associated query included in the first SQL statement according to the json data.

Preferably, the first SQL statement and the second SQL statement include corresponding to-be-processed data tables, and the obtaining module 410 is further configured to replace the to-be-processed data tables with corresponding entity tables when the corresponding to-be-processed data tables are temporary tables.

Preferably, the processing module 430 is further configured to group association calculations according to association relation keywords; and when any group comprises the correlation calculation with the same corresponding data table, determining that the first SQL statement and the second SQL statement have repeated correlation calculation.

Example four

Corresponding to the above method, apparatus, and system, a fourth embodiment of the present application provides a computer system, including:

one or more processors; and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

acquiring a first SQL statement and a second SQL statement to be identified;

Fig. 5 illustrates an architecture of a computer system 1500 that may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected by a communication bus 1530.

The processor 1510 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the present Application.

The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a basic input output system BIOS1522 for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like may also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.

The input/output interface 1513 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various sensors, etc., and the output devices may include a display, speaker, vibrator, indicator light, etc.

The network interface 1514 is used to connect a communication module (not shown) to enable the communication interaction of the present device with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.

In addition, the computer system 1500 may also obtain information of specific pickup conditions from the virtual resource object pickup condition information database 1541 for performing condition judgment, and the like.

It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the device may also include other components necessary for normal operation. In addition, it will be understood by those skilled in the art that the above-described apparatus may also include only the components necessary to implement the embodiments of the present application, and need not include all of the components shown in the figures.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An identification method for repeating association calculations, the method comprising:

acquiring a first SQL statement and a second SQL statement to be identified;

when repeated correlation calculation exists in the first correlation query and the second correlation query, determining that repeated correlation calculation exists between the first SQL statement and the second SQL statement;

the analyzing the second SQL statement and identifying the second associated query included in the second SQL statement includes:

determining second association calculation included in a second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words;

identifying a third association relation keyword included in the second SQL statement and a third data table and a sub-query corresponding to the third association relation keyword;

2. The method of claim 1, wherein the association calculation includes corresponding data tables and association keywords, the association keywords are used to describe association calculations required to be performed between the data tables, the parsing the first SQL statement, and the identifying a first association query included in the first SQL statement includes:

3. The method of claim 1, wherein parsing the first SQL statement and identifying a first associated query comprised by the first SQL statement comprises:

4. The method according to any of claims 1-3, wherein the first SQL statement and the second SQL statement include corresponding to-be-processed data tables, and the method includes:

5. The method of claim 2, wherein determining that there is a duplicate association calculation between the first SQL statement and the second SQL statement when there is a duplicate association calculation between the first association query and the second association query comprises:

6. An apparatus for identifying duplicate association calculations, the apparatus comprising:

the analysis module is used for analyzing the first SQL statement and identifying a first correlation query included in the first SQL statement, wherein the correlation query includes correlation calculation between data tables required by execution of the SQL statement; analyzing the second SQL statement and identifying a second associated query included in the second SQL statement;

the processing module is used for determining that repeated correlation calculation exists between the first SQL statement and the second SQL statement when repeated correlation calculation exists between the first correlation query and the second correlation query;

the second SQL statement comprises a sub query and an associated query of the sub query and a third data table, and the analysis module is further used for analyzing the second SQL statement and identifying a second associated relation keyword and a fourth data table and a fifth data table corresponding to the second associated relation keyword, wherein the second associated relation keyword is included in the sub query; determining second association calculation included in a second association query according to the second association relation key words and a fourth data table and a fifth data table corresponding to the second association relation key words; identifying a third association relation keyword included in the parsed second SQL statement, wherein the third association relation keyword is used for describing association query of the sub-query and a third data table; determining third association calculation included in the second association query according to the third association relation key words, the third data table and the fourth data table; and determining fourth association calculation included in the second association query according to the third association relation key words, the third data table and the fifth data table.

7. The apparatus according to claim 6, wherein the parsing module is further configured to parse the first SQL statement, and identify a first association relation keyword included in the first SQL statement and a first data table and a second data table corresponding to the first association relation keyword; and determining first association calculation included in the first association query according to the first data table, the second data table and the first association relation key word.

8. A computer system, the system comprising:

one or more processors;

acquiring a first SQL statement and a second SQL statement to be identified;

when the first correlation query and the second correlation query have repeated correlation calculation, determining that the first SQL statement and the second SQL statement have repeated correlation calculation;