CN116795663B - Method for tracking and analyzing execution performance of trino engine - Google Patents

Method for tracking and analyzing execution performance of trino engine Download PDF

Info

Publication number
CN116795663B
CN116795663B CN202311072998.1A CN202311072998A CN116795663B CN 116795663 B CN116795663 B CN 116795663B CN 202311072998 A CN202311072998 A CN 202311072998A CN 116795663 B CN116795663 B CN 116795663B
Authority
CN
China
Prior art keywords
sql
engine
json data
query
trino
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311072998.1A
Other languages
Chinese (zh)
Other versions
CN116795663A (en
Inventor
韩雷
张琪浩
张自平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunzhu Information Technology Chengdu Co ltd
Original Assignee
Yunzhu Information Technology Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunzhu Information Technology Chengdu Co ltd filed Critical Yunzhu Information Technology Chengdu Co ltd
Priority to CN202311072998.1A priority Critical patent/CN116795663B/en
Publication of CN116795663A publication Critical patent/CN116795663A/en
Application granted granted Critical
Publication of CN116795663B publication Critical patent/CN116795663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of big data, in particular to a method for tracking and analyzing the execution performance of a trino engine, which comprises the steps of collecting SQL query metadata based on an event monitoring mode of the trino engine, and analyzing the SQL query metadata into first JSON data; based on SQL features, clustering is carried out on the first JSON data, and second JSON data is generated and automatically input into the Hive database. The method and the device acquire SQL execution information based on the event monitoring mode provided by the trino engine, store the information into an external Hive database, store massive SQL information, and solve the problem that historical SQL execution records cannot be stored due to the memory problem caused by the trino engine calculation; the SQL features are used for clustering processing, so that the SQL of the same type can be identified, the classification accuracy is high, and the classification speed is high.

Description

Method for tracking and analyzing execution performance of trino engine
Technical Field
The invention relates to the technical field of big data, in particular to a method for tracking and analyzing the execution performance of a trino engine.
Background
In the technical field that big data is taken as an OLAP engine, SQL execution speed reflects cluster performance and is one of important indexes of the OLAP, but in an actual production environment, it becomes difficult to track SQL performance, generally, an SQL is set on a database level, execution time is long, and therefore the execution time is generalized to be slow SQL, and then the slow SQL is sent to a business party through some alarm mechanisms. There are several problems in the middle, firstly, SQL is defined differently for the standard of slow SQL, and the execution time length is not well estimated; secondly, the storage pressure of the trino engine is limited, and all historical SQL execution data cannot be stored in the memory; third, although the same SQL is used, the query speed is different because the parameters input by the daily query are different, so that the SQL service cannot be counted, and the cost and the time are consumed. The lack of synchronization and ring ratio in a massive SQL-executed big data environment makes it difficult to optimize the performance of trino for the above reasons.
Accordingly, the present invention provides a method for tracking and analyzing the execution performance of a trino engine, so as to solve at least some of the above technical problems.
Disclosure of Invention
The invention aims to solve the technical problems that: a method for tracking and analyzing execution performance of a trino engine is provided to solve at least some of the technical problems.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method of tracking and analyzing the performance of a trino engine, comprising the steps of:
step 1, based on an event monitoring mode of a trino engine, acquiring SQL query metadata, and analyzing the SQL query metadata into first JSON data;
and step 2, clustering the first JSON data based on SQL features to generate second JSON data and automatically inputting the second JSON data into the Hive database.
Further, the step 1 includes: step 11, initializing a client of a kafka client; step 12, capturing, monitoring, capturing, inquiring and creating an event and inquiring and completing the event based on an event monitor of a trino engine coordinator component, and collecting SQL inquiry metadata corresponding to the inquiring and completing event in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client.
Further, in the step 12, the SQL query metadata corresponding to the completed event includes, but is not limited to, a start time of the query, an end time of the query, SQL text information of the query, a data table included in the query, and a user initiating the query.
Further, in the step 13, the collected metadata is parsed into the first JSON data by the query end interface function of the event listener.
Further, the step 2 includes: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; and step 25, automatically inputting the second JSON data into the Hive database.
Further, the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data.
Further, in the step 25, the second JSON data is automatically entered into the Hive database by the Hive sink of the Flink stream computation engine.
Further, in the step 21, the kafka is linked by using the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed into the flank stream computation engine.
Compared with the prior art, the invention has the following beneficial effects:
the method and the device acquire SQL execution information based on the event monitoring mode provided by the trino engine, store the information into the external Hive database, store massive SQL information, and solve the problem that historical SQL execution records cannot be stored due to the memory problem caused by the trino engine calculation. The invention performs clustering processing based on SQL features, can identify the same type of SQL, and has high classification accuracy and high classification speed.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
Technical noun:
the trino engine is an open-source distributed SQL query engine and is suitable for interactive analysis query;
JSON is a lightweight data exchange format;
flink is a framework and distributed processing engine for performing state computation on unbounded and bounded data streams;
kafka is a distributed message engine and stream processing platform;
kafka-client is a client version of kafka officially invoking kafka;
the Kafka Connect component is a tool for scalable and reliable streaming of data between Apache Kafka and other data systems;
the SQL-player component is a tool for performing lexical analysis and grammar analysis on SQL sentences, and then realizing the requirements of deconstructing, processing, replacing, reassembling and the like on the SQL sentences;
the Hive database is a data warehouse tool based on Hadoop and is used for extracting, converting and loading data;
the MD5 algorithm is one of the Hash algorithms, called message digest algorithm.
The present invention will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the big data field, a trino engine is generally used to perform SQL semantic query on Hive's data source. An SQL query task is submitted to a Coordinator (Coordinator) component through JDBC (Java database connection), then the Coordinator component analyzes the SQL to acquire a table and a field of the query in the SQL, and then the disassembled task is distributed to a working node for concurrent execution of the task. The execution structure is finally returned to the Coordinator (Coordinator) component and then to the client. the trino engine cannot store all the execution data of the historical SQL in the memory and cannot identify the same type of SQL.
For this purpose, the method for tracking and analyzing the execution performance of the trino engine provided by the invention, as shown in fig. 1, comprises the following steps:
step 1, based on an event monitoring mode of a trino engine, acquiring SQL query metadata, and analyzing the SQL query metadata into first JSON data;
and step 2, clustering the first JSON data based on SQL features to generate second JSON data and automatically inputting the second JSON data into the Hive database.
Based on the mode of event monitoring provided by the trino engine, SQL execution information is obtained, and the information is stored in an external Hive database, so that massive SQL information can be stored; the invention performs clustering processing based on SQL features, and can identify the same type of SQL. The method solves the problem that the historical SQL execution record cannot be stored due to the memory problem caused by the existing trino engine calculation, and has high classification accuracy and high classification speed.
In some embodiments, the step 1 includes: step 11, initializing a client of the kafka client for sending the acquired data to the kafka cluster; step 12, capturing and monitoring an event (query createdevent) which is created by the query and an event (query completeevent) which is completed by the query based on an event listener (eventListener) of a trino engine Coordinator (Coordinator) component, and collecting SQL query metadata corresponding to the event which is completed by the query in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client. An event listener (EventListener) typically exposed by a Coordinator (Coordinator) component node may allow a user to capture that a query has created an event (query createdevent) and that the query has completed an event (query completeevent). To this end, the invention uses the interface of the event listener (EventListener) to focus on the acquisition and listening of the query created event (QueryCreatedEvent), the acquisition and listening of the query completed event (QueryCompleteEvent), and then collects SQL query metadata corresponding to the query completed event in the trino engine, wherein the query metadata comprises but is not limited to the start time of the query, the end time of the query, SQL text information of the query, a data table contained in the query, and a user initiating the query. Preferably, the collected metadata is parsed into first JSON data through a query end (query completed) interface function of an event listener (EventListener). Finally, the first JSON data is sent into the kafka cluster using the kafka client for subsequent consumption of the SQL clustering service.
In some embodiments, the step 2 includes: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; and step 25, automatically inputting second JSON data into the Hive database.
In the parsing field of the abstract syntax tree, each node corresponds to a node type, each SQL engine maintains a type mapping code table, and hundreds of keyword mapping relations are maintained in the trino engine. Then, SQL features need to be extracted based on the abstract syntax tree and the type map, and then the SQL features representing the variables are replaced and modified. For example, the SQL query is select ageFrom user where name = "xxxx", where xxxx is the SQL feature representing the variable.
Preferably, the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data. For example, SQL queries are select username, age from userInfo where age >20 and level >5, respectively; select username, age from userInfo where age >30 and level >10; wherein the content of the query is the same, but the condition of the query is different. In SQL clustering, the two SQL features are identical. After passing through the SQL feature. The two SQL queries are both converted to 32 unique feature strings: 18aa7350edf82111e6d03abfd2fa1fba.
Preferably, in the step 25, the second JSON data is automatically entered into the Hive database by the Hive sink of the Flink stream computation engine. Preferably, the kafka is linked by the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed into the flank stream computation engine.
After the collected SQL query metadata are sent to kafka and clustered, the second JSON data are automatically input into the Hive database, so that the data can be analyzed, the historical execution speed, execution times and the like of a certain type of SQL are counted, and data references for performance optimization are provided for operation and maintenance engineers and research and development engineers.
Finally, it should be noted that: the above embodiments are merely preferred embodiments of the present invention for illustrating the technical solution of the present invention, but not limiting the scope of the present invention; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions; that is, even though the main design concept and spirit of the present invention is modified or finished in an insubstantial manner, the technical problem solved by the present invention is still consistent with the present invention, and all the technical problems are included in the protection scope of the present invention; in addition, the technical scheme of the invention is directly or indirectly applied to other related technical fields, and the technical scheme is included in the scope of the invention.

Claims (5)

1. A method for tracking and analyzing execution performance of a trino engine, comprising the steps of:
step 1, based on an event monitoring mode of a trino engine, acquiring SQL query metadata, and analyzing the SQL query metadata into first JSON data;
step 2, based on SQL features, clustering the first JSON data to generate second JSON data and automatically inputting the second JSON data into a Hive database;
the step 1 comprises the following steps: step 11, initializing a client of a kafka client; step 12, capturing and monitoring an inquiry established event and an inquiry completed event based on an event monitor of a trino engine coordinator component, and collecting SQL inquiry metadata corresponding to the inquiry completed event in the trino engine; step 13, analyzing the collected SQL query metadata into first JSON data; step 14, transmitting the first JSON data into the kafka cluster by using the kafka client;
the step 2 comprises the following steps: step 21, first JSON data in the kafka cluster is accessed into a Flink stream calculation engine; step 22, adopting Map operator to integrate SQL Parser component of trino engine, initializing an SQL Parser instance; step 23, extracting a query field in the first JSON data from the Map operator, and analyzing the query field into an abstract syntax tree by utilizing an SQL Parser component; step 24, extracting SQL features based on the abstract syntax tree, and converting the SQL features representing the variables to obtain second JSON data; step 25, automatically inputting second JSON data into a Hive database;
the step 24 includes: step 241, collecting the extracted SQL features to obtain a feature set; step 242, deeply recursively traversing the abstract syntax tree, judging the types of the abstract syntax tree nodes one by one, if the types of the current nodes are numerical types or character string types, using question mark characters to replace text values of the current nodes as SQL features and putting the SQL features into a feature set, otherwise, using the text values of the current nodes as SQL features and putting the SQL features into the feature set; and 243, splicing each SQL feature of the feature set into a character string, converting the character string into a 32-bit feature character string by adopting an MD5 algorithm, and adding the feature character string into the first JSON data to obtain second JSON data.
2. The method of claim 1, wherein in the step 12, the SQL query metadata corresponding to the completed event includes a start time of the query, an end time of the query, SQL text information of the query, a data table included in the query, and a user initiating the query.
3. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 13, the collected metadata is parsed into the first JSON data by the query end interface function of the event listener.
4. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 25, the second JSON data is automatically entered into the Hive database by means of Hive sink of the flank stream computation engine.
5. The method for tracking and analyzing the execution performance of a trino engine according to claim 1, wherein in the step 21, the kafka is linked by using the kafka-connector component of the flank stream computation engine, and the first JSON data in the kafka cluster is accessed to the flank stream computation engine.
CN202311072998.1A 2023-08-24 2023-08-24 Method for tracking and analyzing execution performance of trino engine Active CN116795663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311072998.1A CN116795663B (en) 2023-08-24 2023-08-24 Method for tracking and analyzing execution performance of trino engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311072998.1A CN116795663B (en) 2023-08-24 2023-08-24 Method for tracking and analyzing execution performance of trino engine

Publications (2)

Publication Number Publication Date
CN116795663A CN116795663A (en) 2023-09-22
CN116795663B true CN116795663B (en) 2023-12-08

Family

ID=88046206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311072998.1A Active CN116795663B (en) 2023-08-24 2023-08-24 Method for tracking and analyzing execution performance of trino engine

Country Status (1)

Country Link
CN (1) CN116795663B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544314A (en) * 2022-10-09 2022-12-30 武汉众邦银行股份有限公司 Unstructured database system
CN115659314A (en) * 2022-12-13 2023-01-31 合肥喆塔科技有限公司 Data service method based on mixed data
CN115905298A (en) * 2023-01-04 2023-04-04 上海通联金融服务有限公司 Presto query engine based on http data source and netty
CN115934789A (en) * 2022-11-24 2023-04-07 深圳航天智慧城市系统技术研究院有限公司 AST-based distributed SQL query method, AST-based distributed SQL query device and AST-based distributed SQL query storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10579627B2 (en) * 2016-01-08 2020-03-03 Microsoft Technology Licensing, Llc Database operation using metadata of data sources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544314A (en) * 2022-10-09 2022-12-30 武汉众邦银行股份有限公司 Unstructured database system
CN115934789A (en) * 2022-11-24 2023-04-07 深圳航天智慧城市系统技术研究院有限公司 AST-based distributed SQL query method, AST-based distributed SQL query device and AST-based distributed SQL query storage medium
CN115659314A (en) * 2022-12-13 2023-01-31 合肥喆塔科技有限公司 Data service method based on mixed data
CN115905298A (en) * 2023-01-04 2023-04-04 上海通联金融服务有限公司 Presto query engine based on http data source and netty

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于MapR的IPTV用户收视行为分析的方案与实践;方艾;张玉忠;徐雄;金铎;;电信科学(第02期);143-148 *

Also Published As

Publication number Publication date
CN116795663A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN111526060B (en) Method and system for processing service log
US11120344B2 (en) Suggesting follow-up queries based on a follow-up recommendation machine learning model
US11914588B1 (en) Determining a user-specific approach for disambiguation based on an interaction recommendation machine learning model
US11941034B2 (en) Conversational database analysis
CN102609451B (en) SQL (structured query language) query plan generation method oriented to streaming data processing
CN109614432B (en) System and method for acquiring data blood relationship based on syntactic analysis
US11170016B2 (en) Navigating hierarchical components based on an expansion recommendation machine learning model
JP2017515180A (en) Processing data sets in big data repositories
CN104899314A (en) Pedigree analysis method and device of data warehouse
US20110307472A1 (en) System and method for managing metadata in a relational database
US11494395B2 (en) Creating dashboards for viewing data in a data storage system based on natural language requests
US11416477B2 (en) Systems and methods for database analysis
CN109710767B (en) Multilingual big data service platform
US20190034247A1 (en) Creating alerts associated with a data storage system based on natural language requests
CN114218472A (en) Intelligent search system based on knowledge graph
WO2022134363A1 (en) Service report integration method and apparatus, device, and storage medium
CN113326261B (en) Data blood relationship extraction method and device and electronic equipment
KR20150023973A (en) The system for collecting and analyzing of information of social network
CN110580170B (en) Method and device for identifying software performance risk
CN112182025A (en) Log analysis method, device, equipment and computer readable storage medium
CN116795663B (en) Method for tracking and analyzing execution performance of trino engine
US20220100742A1 (en) Lossless Switching Between Search Grammars
CN116126950A (en) Real-time materialized view system and method
CN113010483A (en) Mass log management method and system
Spanos et al. SensorStream: A semantic real–time stream management system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant