CN107748766A

CN107748766A - A kind of big data method for quickly querying based on Presto and Elasticsearch

Info

Publication number: CN107748766A
Application number: CN201710900970.0A
Authority: CN
Inventors: 洪灿榕; 吴晓梅; 李明溪; 蔡炜榕
Original assignee: Linewell Software Co Ltd
Current assignee: Linewell Software Co Ltd
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2018-03-02
Anticipated expiration: 2037-09-28
Also published as: CN107748766B

Abstract

The present invention relates to a kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters by all, then SQL request is received and parsed through by Presto clusters and generates corresponding inquiry plan, obtain index range of the data for meeting inquiry plan in Elasticsearch clusters；By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented；Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position, is greatly decreased the reading of redundant data, to improve the performance of random formfeed inquiry.

Description

A kind of big data method for quickly querying based on Presto and Elasticsearch

Technical field

The present invention relates to the method for quickly querying of big data, specifically provide it is a kind of based on Presto and Elasticsearch big data method for quickly querying.

Background technology

Elasticsearch is one and established in full-text search engine Apache Lucene^TMOn the basis of real-time analysis Distributed search engine, it realizes the function of all indexes and search using Lucene as core so that each document Content can be indexed, searches for, sorts, filter.Simultaneously, there is provided abundant polymerizable functional, multidimensional can be carried out to data Degree analysis.But elasticsearch lacks traditional SQL syntax and supports that developer's use is more difficult, using relevant database as System its Data Migration on basis, mating operation are not easy to carry out, especially when user carries out inquiry and page turning, complicated inquiry Make elasticsearch servers for a long time in the state of high capacity, while in order to turn over random, large-scale page turning Substantial amounts of number of pages is crossed, have read very more data that can't be used.And presto can be provided substantially for elasticsearch SQL syntax support that but its inquiry mechanism based on internal memory is also required to substantial amounts of target data to read in advance in cluster memory, disappears Substantial amounts of server resource and time are consumed, also fails to fundamentally solve the problem.

The content of the invention

It is an object of the invention to provide a kind of big data quick search side based on Presto and Elasticsearch Method, it carries out SQL reception and parsing using presto clusters, with reference to elasticsearch Date Histogram Aggregation aggregate statistics functions, progressively position target page data, realize the quick search of data.

To achieve the above object, the technical solution adopted by the present invention is：

A kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried data by all Comprising time field and per diem the mode of subindex is stored in Elasticsearch clusters, is then connect by Presto clusters Receive and parse SQL request and generate corresponding inquiry plan, obtain the data for meeting inquiry plan in Elasticsearch clusters Index range；By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented； Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.

The big data method for quickly querying specifically includes following steps：

Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met inquiry bar Index range of the data of part in elasticsearch clusters and the data total number for meeting querying condition；

The data strip of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range Number obtain target data page be located at which or which index in, the number of data for the index skipped directly is detained from OFFSET Remove, draw new data offset OFFSET_1；

Step 3, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 2 Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period, It is specific as follows：

Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, with reference to step The data offset OFFSET_1 drawn in rapid 2, calculate the data to be inquired about and be located in which section hours, skip Hours, the number of data of section directly deducted from OFFSET_1, new data offset OFFSET_2 was drawn, by OFFSET_ 2 make comparisons with the threshold value M of systemic presupposition；

If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_2 is more than M, enter in next step；

Step 3.2, by minute be unit step 3.1 is drawn hour section data count, it is inclined with reference to data Shifting amount OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data for the period skipped Directly deducted from OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made Compare；

If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_3 is more than M, enter in next step；

Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, it is inclined with reference to data Shifting amount OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped from Directly deducted in OFFSET_3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL In the querying condition of sentence, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto Afterbody COUNT datas are intercepted in cluster, obtain final Query Result；The COUNT is the number of data of every page.

The present invention generates corresponding inquiry plan by receiving and parsing through SQL request by Presto clusters, obtains and meets Index range of the data of inquiry plan in Elasticsearch clusters；By progressively counting and calculating, orient to be looked into Index and time interval where the target page data of inquiry；Time interval is added in the querying condition of former SQL statement from The data message of page object is read in Elasticsearch clusters, so as to quickly positioning target page data position, is significantly subtracted The reading of few redundant data, to improve the performance of random formfeed inquiry.

Brief description of the drawings

Fig. 1 is present system frame diagram；

Fig. 2 is target page data positioning flow figure of the present invention.

Embodiment

As shown in figure 1, as depicted in figs. 1 and 2, present invention is disclosed a kind of based on Presto's and Elasticsearch Big data method for quickly querying, it comprises the following steps：

Step 1, the data being queried comprise at least the field of a timestamp type, and the mode being daily distributed is protected In each index that elasticsearch be present；Example 2017-01-01 08:00:00 data are just stored in entitled 2017-01- In 01 index；

Step 2, user submit SQL query to meet following form to presto servers, the SQL：

SELECT COLUMN1,COLUMN2...FROM TABLE[WHERE COLLECTTIME>、<,=' yyyy-MM- dd HH：mm：Ss ' AND COLUMN1=' XXX '] ORDER BY COLLECTTIME [, COLUMN3, COLUMN4...] [LIMIT OFFSET,LIMIT]

SQL is parsed and is generated corresponding inquiry plan by server, and the data inquired about so as to learn user to need are stored in In elasticsearch which index, the index range of target data is obtained；

Step 3, the index range obtained according to step 2, with the querying condition submitted in the SQL, each rope is inquired about respectively Meet the data total number of the condition in drawing；

Step 4, each index data bar number drawn according to step 3, with reference to user submit SQL in data offset The OFFSET and number of data COUNT of every page obtains target data page and is located in which or which index, the index skipped Number of data directly deducted from OFFSET, draw new offset OFFSET_1；

Step 5, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 4 Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period, It is specific as follows：

Step 5.1, counted for unit, due to only including the data of one day in an index, therefore can be obtained by the hour The number of data of 24 periods gone out among one day；With reference to the OFFSET_1 drawn in step 4, it can calculate what is inquired about Which is data be located in period, and the period number of data can skipped directly deducts from OFFSET_1, draws new OFFSET_2, OFFSET_2 is made comparisons with the threshold value M of systemic presupposition；

If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_2 is more than M, enter in next step；

Step 5.2, using previous step draw hour section as querying condition, by minute be single to the data in the section Position is counted, and is had 60 minutes due to 1 hour, therefore can draw qualified number of data per minute in the hour；Knot Close the OFFSET_2 that draws in step 5.1, the data to be inquired about can be calculated and be located in which minutes section, skip when Between segment data bar number can directly deducted from OFFSET_2, new OFFSET_3 is drawn, by OFFSET_3 and systemic presupposition Threshold value M make comparisons；

If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_3 is more than M, enter in next step；

Step 5.3, using previous step draw minute section as querying condition, by the second be unit to the data in the section Counted, had 60 seconds due to 1 minute, therefore qualified number of data per second in this point can be drawn；With reference to step 5.2 In the OFFSET_3 that draws, the data to be inquired about can be calculated and be located in which period, the period number of data skipped Can is directly deducted from OFFSET_3, draws new OFFSET_4, and the time range of the period is added into former SQL statement Querying condition in, from Elasticsearch clusters read quantity be OFFSET_4+COUNT quantity, in Presto clusters Middle interception afterbody COUNT datas, obtain final Query Result.

The present invention is queried packet field containing time and per diem the mode of subindex is stored in by all In Elasticsearch clusters, SQL request is then received and parsed through by Presto clusters and generates corresponding inquiry plan, is obtained Take index range of the data for meeting inquiry plan in Elasticsearch clusters；By progressively counting and calculating, orient Index and time interval where the target page data to be inquired about；Time interval is added in the querying condition of former SQL statement The data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position, The reading of redundant data is greatly decreased, to improve the performance of random formfeed inquiry.

It is described above, only it is the embodiment of the present invention, is not intended to limit the scope of the present invention, thus it is every Any subtle modifications, equivalent variations and modifications that technical spirit according to the present invention is made to above example, still fall within this In the range of inventive technique scheme.

Claims

A kind of 1. big data method for quickly querying based on Presto and Elasticsearch, it is characterised in that：It is by all quilts Inquire about packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters, then pass through Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, and acquisition meets that the data of inquiry plan exist Index range in Elasticsearch clusters；By progressively counting and calculating, the target page data to be inquired about institute is oriented Index and time interval；Time interval is added in the querying condition of former SQL statement and read from Elasticsearch clusters Take the data message of page object.
2. a kind of big data method for quickly querying based on Presto and Elasticsearch according to claim 1, its It is characterised by：The big data method for quickly querying specifically includes following steps：

Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met querying condition Index range of the data in elasticsearch clusters and the data total number for meeting querying condition；

The number of data of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range obtains To target data page be located at which or which index in, the number of data for the index skipped directly deducts from OFFSET, obtains Go out new data offset OFFSET_1；

Step 3, the DateHistogramAggregation interfaces by Elasticsearch, the number of targets obtained to step 2 The segmentation carried out temporally according to the index where page counts, and obtains the number of data for meeting condition in each period, specifically such as Under：

Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, inclined with reference to data Shifting amount OFFSET_1, calculate the data to be inquired about and be located in which section hours, skip hours section data Bar number directly deducts from OFFSET_1, draws new data offset OFFSET_2, by OFFSET_2 and the threshold of systemic presupposition Value M makes comparisons；

If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_2 is more than M, enter in next step；

Step 3.2, by minute be unit step 3.1 is drawn hour section data count, with reference to data offset OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data of the period skipped from Directly deducted in OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made into ratio Compared with；

If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result；If OFFSET_3 is more than M, enter in next step；

Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, with reference to data offset OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped is from OFFSET_ Directly deducted in 3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL statement In querying condition, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto clusters Afterbody COUNT datas are intercepted, obtain final Query Result；

The COUNT is the number of data of every page.