CN116795663B

CN116795663B - Method for tracking and analyzing execution performance of trino engine

Info

Publication number: CN116795663B
Application number: CN202311072998.1A
Authority: CN
Inventors: 韩雷; 张琪浩; 张自平
Original assignee: Yunzhu Information Technology Chengdu Co ltd
Current assignee: Yunzhu Information Technology Chengdu Co ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-12-08
Anticipated expiration: 2043-08-24
Also published as: CN116795663A

Abstract

The invention relates to the technical field of big data, in particular to a method for tracking and analyzing the execution performance of a trino engine, which comprises the steps of collecting SQL query metadata based on an event monitoring mode of the trino engine, and analyzing the SQL query metadata into first JSON data; based on SQL features, clustering is carried out on the first JSON data, and second JSON data is generated and automatically input into the Hive database. The method and the device acquire SQL execution information based on the event monitoring mode provided by the trino engine, store the information into an external Hive database, store massive SQL information, and solve the problem that historical SQL execution records cannot be stored due to the memory problem caused by the trino engine calculation; the SQL features are used for clustering processing, so that the SQL of the same type can be identified, the classification accuracy is high, and the classification speed is high.

Description

Method for tracking and analyzing execution performance of trino engine

Technical Field

The invention relates to the technical field of big data, in particular to a method for tracking and analyzing the execution performance of a trino engine.

Background

In the technical field that big data is taken as an OLAP engine, SQL execution speed reflects cluster performance and is one of important indexes of the OLAP, but in an actual production environment, it becomes difficult to track SQL performance, generally, an SQL is set on a database level, execution time is long, and therefore the execution time is generalized to be slow SQL, and then the slow SQL is sent to a business party through some alarm mechanisms. There are several problems in the middle, firstly, SQL is defined differently for the standard of slow SQL, and the execution time length is not well estimated; secondly, the storage pressure of the trino engine is limited, and all historical SQL execution data cannot be stored in the memory; third, although the same SQL is used, the query speed is different because the parameters input by the daily query are different, so that the SQL service cannot be counted, and the cost and the time are consumed. The lack of synchronization and ring ratio in a massive SQL-executed big data environment makes it difficult to optimize the performance of trino for the above reasons.

Accordingly, the present invention provides a method for tracking and analyzing the execution performance of a trino engine, so as to solve at least some of the above technical problems.

Disclosure of Invention

The invention aims to solve the technical problems that: a method for tracking and analyzing execution performance of a trino engine is provided to solve at least some of the technical problems.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a method of tracking and analyzing the performance of a trino engine, comprising the steps of:

step 1, based on an event monitoring mode of a trino engine, acquiring SQL query metadata, and analyzing the SQL query metadata into first JSON data;

and step 2, clustering the first JSON data based on SQL features to generate second JSON data and automatically inputting the second JSON data into the Hive database.

Further, the step 1 includes: step 11, initializing a client of a kafka client; step 12, capturing, monitoring, capturing, inquiring and creating an event and inquiring and completing the event based on an event monitor of a trino engine coordinator component, and collecting SQL inquiry metadata corresponding to the inquiring and completing event in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client.

Further, in the step 12, the SQL query metadata corresponding to the completed event includes, but is not limited to, a start time of the query, an end time of the query, SQL text information of the query, a data table included in the query, and a user initiating the query.

Further, in the step 13, the collected metadata is parsed into the first JSON data by the query end interface function of the event listener.

Further, the step 2 includes: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; and step 25, automatically inputting the second JSON data into the Hive database.

Further, the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data.

Further, in the step 25, the second JSON data is automatically entered into the Hive database by the Hive sink of the Flink stream computation engine.

Further, in the step 21, the kafka is linked by using the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed into the flank stream computation engine.

Compared with the prior art, the invention has the following beneficial effects:

the method and the device acquire SQL execution information based on the event monitoring mode provided by the trino engine, store the information into the external Hive database, store massive SQL information, and solve the problem that historical SQL execution records cannot be stored due to the memory problem caused by the trino engine calculation. The invention performs clustering processing based on SQL features, can identify the same type of SQL, and has high classification accuracy and high classification speed.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

Technical noun:

the trino engine is an open-source distributed SQL query engine and is suitable for interactive analysis query;

JSON is a lightweight data exchange format;

flink is a framework and distributed processing engine for performing state computation on unbounded and bounded data streams;

kafka is a distributed message engine and stream processing platform;

kafka-client is a client version of kafka officially invoking kafka;

the Kafka Connect component is a tool for scalable and reliable streaming of data between Apache Kafka and other data systems;

the SQL-player component is a tool for performing lexical analysis and grammar analysis on SQL sentences, and then realizing the requirements of deconstructing, processing, replacing, reassembling and the like on the SQL sentences;

the Hive database is a data warehouse tool based on Hadoop and is used for extracting, converting and loading data;

the MD5 algorithm is one of the Hash algorithms, called message digest algorithm.

The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the big data field, a trino engine is generally used to perform SQL semantic query on Hive's data source. An SQL query task is submitted to a Coordinator (Coordinator) component through JDBC (Java database connection), then the Coordinator component analyzes the SQL to acquire a table and a field of the query in the SQL, and then the disassembled task is distributed to a working node for concurrent execution of the task. The execution structure is finally returned to the Coordinator (Coordinator) component and then to the client. the trino engine cannot store all the execution data of the historical SQL in the memory and cannot identify the same type of SQL.

For this purpose, the method for tracking and analyzing the execution performance of the trino engine provided by the invention, as shown in fig. 1, comprises the following steps:

Based on the mode of event monitoring provided by the trino engine, SQL execution information is obtained, and the information is stored in an external Hive database, so that massive SQL information can be stored; the invention performs clustering processing based on SQL features, and can identify the same type of SQL. The method solves the problem that the historical SQL execution record cannot be stored due to the memory problem caused by the existing trino engine calculation, and has high classification accuracy and high classification speed.

In some embodiments, the step 1 includes: step 11, initializing a client of the kafka client for sending the acquired data to the kafka cluster; step 12, capturing and monitoring an event (query createdevent) which is created by the query and an event (query completeevent) which is completed by the query based on an event listener (eventListener) of a trino engine Coordinator (Coordinator) component, and collecting SQL query metadata corresponding to the event which is completed by the query in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client. An event listener (EventListener) typically exposed by a Coordinator (Coordinator) component node may allow a user to capture that a query has created an event (query createdevent) and that the query has completed an event (query completeevent). To this end, the invention uses the interface of the event listener (EventListener) to focus on the acquisition and listening of the query created event (QueryCreatedEvent), the acquisition and listening of the query completed event (QueryCompleteEvent), and then collects SQL query metadata corresponding to the query completed event in the trino engine, wherein the query metadata comprises but is not limited to the start time of the query, the end time of the query, SQL text information of the query, a data table contained in the query, and a user initiating the query. Preferably, the collected metadata is parsed into first JSON data through a query end (query completed) interface function of an event listener (EventListener). Finally, the first JSON data is sent into the kafka cluster using the kafka client for subsequent consumption of the SQL clustering service.

In some embodiments, the step 2 includes: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; and step 25, automatically inputting second JSON data into the Hive database.

In the parsing field of the abstract syntax tree, each node corresponds to a node type, each SQL engine maintains a type mapping code table, and hundreds of keyword mapping relations are maintained in the trino engine. Then, SQL features need to be extracted based on the abstract syntax tree and the type map, and then the SQL features representing the variables are replaced and modified. For example, the SQL query is select ageFrom user where name = "xxxx", where xxxx is the SQL feature representing the variable.

Preferably, the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data. For example, SQL queries are select username, age from userInfo where age >20 and level >5, respectively; select username, age from userInfo where age >30 and level >10; wherein the content of the query is the same, but the condition of the query is different. In SQL clustering, the two SQL features are identical. After passing through the SQL feature. The two SQL queries are both converted to 32 unique feature strings: 18aa7350edf82111e6d03abfd2fa1fba.

Preferably, in the step 25, the second JSON data is automatically entered into the Hive database by the Hive sink of the Flink stream computation engine. Preferably, the kafka is linked by the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed into the flank stream computation engine.

After the collected SQL query metadata are sent to kafka and clustered, the second JSON data are automatically input into the Hive database, so that the data can be analyzed, the historical execution speed, execution times and the like of a certain type of SQL are counted, and data references for performance optimization are provided for operation and maintenance engineers and research and development engineers.

Finally, it should be noted that: the above embodiments are merely preferred embodiments of the present invention for illustrating the technical solution of the present invention, but not limiting the scope of the present invention; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions; that is, even though the main design concept and spirit of the present invention is modified or finished in an insubstantial manner, the technical problem solved by the present invention is still consistent with the present invention, and all the technical problems are included in the protection scope of the present invention; in addition, the technical scheme of the invention is directly or indirectly applied to other related technical fields, and the technical scheme is included in the scope of the invention.

Claims

1. A method for tracking and analyzing execution performance of a trino engine, comprising the steps of:

step 2, based on SQL features, clustering the first JSON data to generate second JSON data and automatically inputting the second JSON data into a Hive database;

the step 1 comprises the following steps: step 11, initializing a client of a kafka client; step 12, capturing and monitoring an inquiry established event and an inquiry completed event based on an event monitor of a trino engine coordinator component, and collecting SQL inquiry metadata corresponding to the inquiry completed event in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client;

the step 2 comprises the following steps: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; step 25, automatically inputting second JSON data into a Hive database;

the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data.

2. The method of claim 1, wherein in the step 12, the SQL query metadata corresponding to the completed event includes a start time of the query, an end time of the query, SQL text information of the query, a data table included in the query, and a user initiating the query.

3. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 13, the collected metadata is parsed into the first JSON data by the query end interface function of the event listener.

4. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 25, the second JSON data is automatically entered into the Hive database by means of Hive sink of the flank stream computation engine.

5. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 21, the kafka is linked by using the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed to the flank stream computation engine.