CN109739882B

CN109739882B - Big data query optimization method based on Presto and Elasticissearch

Info

Publication number: CN109739882B
Application number: CN201910010085.4A
Authority: CN
Inventors: 洪灿榕
Original assignee: Linewell Software Co Ltd
Current assignee: Linewell Software Co Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2022-06-10
Anticipated expiration: 2039-01-04
Also published as: CN109739882A

Abstract

The invention relates to a big data query optimization method based on Presto and Elasticissearch, which stores all queried data in an Elasticissearch cluster, receives and analyzes an SQL request through the Presto cluster to generate a corresponding abstract syntax tree and an execution plan, converts a query condition into an Elasticissearch query request and issues the Elasticissearch query request to the Elasticissearch for pre-query. The invention utilizes the advantages of Presto SQL analysis and Elasticissearch quick query to pre-filter partial data as much as possible before retrieving the data, thereby greatly reducing the reading of redundant data and improving the performance of conditional query.

Description

Big data query optimization method based on Presto and Elasticissearch

Technical Field

The invention relates to a quick query method of big data, and particularly provides a big data query optimization method based on Presto and Elasticissearch.

Background

The Elasticissearch is a search engine built in the full text, Apache Lucene^TMThe distributed search engine based on real-time analysis uses Lucene as a core to realize all the functions of indexing and searching, so that the content of each document can be indexed, searched, sorted and filtered. However, the elastic search lacks the traditional SQL syntax support, so that developers are difficult to use, and data migration and docking work of a system based on a relational database is difficult to perform. While Presto can provide basic SQL syntax support for the Elasticsearch, but the memory-based query mechanism also needs to pre-read almost the whole amount of target data into the cluster memory, and the data required by the end user is only dozens or even several of massive data, so that a large amount of server resources and time are consumed in the process for reading and filtering redundant data, and the query efficiency is greatly reduced.

Disclosure of Invention

The invention aims to provide a big data query optimization method based on Presto and Elasticissearch, which uses a Presto cluster to receive and analyze SQL, and utilizes the characteristics of Elasticissearch quick query and filtering to pre-filter partial data in the Elasticissearch cluster, thereby reducing reading of redundant data and realizing quick query of data.

In order to achieve the purpose, the invention adopts the technical scheme that:

a big data query optimization method based on Presto and Elasticissearch stores the queried data in the index of Elasticissearch cluster, when the user submits SQL query request to Presto cluster, the following steps are executed:

step 1, a Presto cluster receives and analyzes an SQL query request and generates a corresponding abstract syntax tree, and a query execution plan tree is generated according to the syntax tree, wherein the query execution plan tree is of a structure without branches from top to bottom;

the SQL query request comprises one or more of a where clause, an order by clause and a limit clause;

step 2, traversing, analyzing and executing each node in the plan tree, specifically as follows:

firstly, finding out a node supported by an Elasticsearch in an execution plan tree, wherein the name and the corresponding SQL operation of the node are respectively as follows:

Node name	Function(s)	Corresponding SQL operations
			FilterNode	Screening data	where clause
TopNNode	Sorting takes the first N rows	order by clause + limit
			LimitNode	Summary top N lines	limit clause
TableScanNode	Reading all data in the table	Single select query

Then, adding a Query parameter in the tablescan node for issuing a Query condition;

and finally, traversing all nodes in the query plan tree, and reading corresponding conditions according to the node types in the traversing process to issue:

when finding the FilterNode, indicating that the SQL Query request comprises a where clause, acquiring the content of the where clause through the filtering condition of the FilterNode, and transmitting the content to a Query in a head node TableSecanNode;

when TopNNode is found, the SQL Query request is shown to contain an order by clause and a limit clause at the same time, and then the contents of the order by clause and the limit clause are obtained through the count, the offset and the order by in the TopNNode and are transmitted to the Query in the first node TableScan node;

when a limit node is found, the SQL is shown to be used independently, and page-turning parameters are obtained through the count and the offset in the limit node and are transmitted to the Query in the first node TableSecanNode;

when finding a tablescan node, it indicates that all nodes of the execution plan have been traversed, and Query in the tablescan node already contains all Query conditions;

Step 3, the Presto cluster acquires Query parameters from the TableSecanNode and generates corresponding Query request parameters according to the parameter contents, wherein the where clause is converted into a filtering condition in an elastic search Query builder, and the order by and limit clauses are converted into a sequencing and data limiting condition in the Query builder;

step 4, sending the query request generated in the step 3 to an Elasticissearch cluster, performing query and pre-filtering on the Elasticissearch cluster, and returning an obtained result to the Presto cluster;

and 5, summarizing the data returned by each node by the Presto cluster, and generating a final result to the user.

After the scheme is adopted, all the queried data are stored in an Elasticissearch cluster, then a corresponding abstract syntax tree and a query plan are generated by receiving and analyzing an SQL request through a Presto cluster, the query request corresponding to the Elasticissearch is generated by analyzing the abstract syntax tree, on the premise of being supported by the Elasticissearch, the query condition is converted into the query request which can be identified by the Elasticissearch as much as possible, the execution plan is modified, the query condition is issued to the Elasticissearch, most data are filtered in advance by the Elasticissearch, the returned data basically meet the query condition requirements, and then the data are collected uniformly by the Presto finally obtain the result data. The invention utilizes the characteristics of fast query and filtration of the Elasticissearch to pre-filter partial data in the Elasticissearch cluster, reduces the reading of redundant data and realizes the fast query of the data.

Drawings

FIG. 1 is an overall system architecture of the present invention;

FIG. 2 is a query execution plan tree and process flow generated by SQL queries in Presto cluster.

Detailed Description

The invention discloses a big data query optimization method based on Presto and Elasticissearch, as shown in FIG. 1, the required system comprises a Presto cluster and an Elasticissearch cluster, a user can submit an SQL query request to the Presto cluster through various modes such as a command line tool, a jdbc client or a graphical interface development tool, the SQL request is received and analyzed by the Presto cluster, and then the query request is issued to the Elasticissearch cluster and data is read.

According to the big data query optimization method based on Presto and Elasticissearch, queried data are stored in the index of the Elasticissearch cluster, as shown in FIG. 2, when a user submits an SQL query request to the Presto cluster, the following steps are executed:

step 1, the Presto cluster receives and analyzes the SQL query request and generates a corresponding abstract syntax tree, and generates a query execution plan tree according to the syntax tree, wherein the query execution plan tree is of a structure without branches from top to bottom.

Wherein the SQL query request contains one or more of a where clause, an order by clause and a limit clause, for example

SELECT ID,NAME FROM EMPLOYEE WHERE ID>3 ORDER BY ID LIMIT 100,10。

When a tablescan node is found, it means that all nodes executing the plan have been traversed, and Query in the tablescan node already contains all Query conditions.

Step 3, the Presto cluster acquires Query parameters from the tableScanNode and generates corresponding Query request parameters according to the parameter contents, wherein the where clause is converted into a filtering condition in an elastic search Query builder, and the order by and limit clauses are converted into a sorting and data limiting condition in the Query builder;

and 4, sending the query request query builder generated in the step 3 to the Elasticissearch cluster, converting the query and sequencing conditions into an Elasticissearch request, then querying and pre-filtering the Elasticissearch cluster, and returning the obtained result to the Presto cluster.

The invention stores all the inquired data in an Elasticissearch cluster, then receives and analyzes an SQL request through a Presto cluster to generate a corresponding abstract syntax tree and an inquiry plan, generates the corresponding inquiry request of the Elasticissearch by analyzing the abstract syntax tree, converts the inquiry condition into the inquiry request which can be identified by the Elasticissearch as far as possible under the premise of being supported by the Elasticissearch, modifies an execution plan, sends the inquiry condition to the Elasticissearch, filters most data in advance by the Elasticissearch, returns the data which basically meets the requirement of the inquiry condition, and then is collected uniformly by Presto finally obtain the result data. When Presto queries one index in the elastic search, all data in the index needs to be read out (such as sorting, aggregation, and the like), while one index in the elastic search usually stores a large amount of data, reading all data out completely takes a long time, server resources consume a very large amount of resources, and only hundreds of data or even tens of data actually used may be used, which results in a very large waste of time and resources. The invention utilizes the characteristics of fast query and filtration of the Elasticissearch to pre-filter partial data in the Elasticissearch cluster, reduces the reading of redundant data and realizes the fast query of the data.

The above description is only exemplary of the present invention and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above exemplary embodiments according to the technical spirit of the present invention are within the technical scope of the present invention.

Claims

1. A big data query optimization method based on Presto and Elasticissearch is characterized by comprising the following steps: the optimization method stores the queried data in the index of the Elasticissearch cluster, and executes the following steps when a user submits an SQL query request to the Presto cluster:

and finally, traversing all nodes in the query plan tree, reading corresponding conditions according to the node types in the traversing process, and issuing the conditions:

step 4, sending the query request generated in the step 3 to an Elasticisearch cluster, performing query and pre-filtering on the Elasticisearch cluster, and returning an obtained result to the Presto cluster;