CN107748766A - A kind of big data method for quickly querying based on Presto and Elasticsearch - Google Patents
A kind of big data method for quickly querying based on Presto and Elasticsearch Download PDFInfo
- Publication number
- CN107748766A CN107748766A CN201710900970.0A CN201710900970A CN107748766A CN 107748766 A CN107748766 A CN 107748766A CN 201710900970 A CN201710900970 A CN 201710900970A CN 107748766 A CN107748766 A CN 107748766A
- Authority
- CN
- China
- Prior art keywords
- data
- offset
- elasticsearch
- clusters
- presto
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters by all, then SQL request is received and parsed through by Presto clusters and generates corresponding inquiry plan, obtain index range of the data for meeting inquiry plan in Elasticsearch clusters;By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented;Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position, is greatly decreased the reading of redundant data, to improve the performance of random formfeed inquiry.
Description
Technical field
The present invention relates to the method for quickly querying of big data, specifically provide it is a kind of based on Presto and
Elasticsearch big data method for quickly querying.
Background technology
Elasticsearch is one and established in full-text search engine Apache LuceneTMOn the basis of real-time analysis
Distributed search engine, it realizes the function of all indexes and search using Lucene as core so that each document
Content can be indexed, searches for, sorts, filter.Simultaneously, there is provided abundant polymerizable functional, multidimensional can be carried out to data
Degree analysis.But elasticsearch lacks traditional SQL syntax and supports that developer's use is more difficult, using relevant database as
System its Data Migration on basis, mating operation are not easy to carry out, especially when user carries out inquiry and page turning, complicated inquiry
Make elasticsearch servers for a long time in the state of high capacity, while in order to turn over random, large-scale page turning
Substantial amounts of number of pages is crossed, have read very more data that can't be used.And presto can be provided substantially for elasticsearch
SQL syntax support that but its inquiry mechanism based on internal memory is also required to substantial amounts of target data to read in advance in cluster memory, disappears
Substantial amounts of server resource and time are consumed, also fails to fundamentally solve the problem.
The content of the invention
It is an object of the invention to provide a kind of big data quick search side based on Presto and Elasticsearch
Method, it carries out SQL reception and parsing using presto clusters, with reference to elasticsearch Date Histogram
Aggregation aggregate statistics functions, progressively position target page data, realize the quick search of data.
To achieve the above object, the technical solution adopted by the present invention is:
A kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried data by all
Comprising time field and per diem the mode of subindex is stored in Elasticsearch clusters, is then connect by Presto clusters
Receive and parse SQL request and generate corresponding inquiry plan, obtain the data for meeting inquiry plan in Elasticsearch clusters
Index range;By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented;
Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.
The big data method for quickly querying specifically includes following steps:
Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met inquiry bar
Index range of the data of part in elasticsearch clusters and the data total number for meeting querying condition;
The data strip of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range
Number obtain target data page be located at which or which index in, the number of data for the index skipped directly is detained from OFFSET
Remove, draw new data offset OFFSET_1;
Step 3, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 2
Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period,
It is specific as follows:
Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, with reference to step
The data offset OFFSET_1 drawn in rapid 2, calculate the data to be inquired about and be located in which section hours, skip
Hours, the number of data of section directly deducted from OFFSET_1, new data offset OFFSET_2 was drawn, by OFFSET_
2 make comparisons with the threshold value M of systemic presupposition;
If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from
The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters
Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;
Step 3.2, by minute be unit step 3.1 is drawn hour section data count, it is inclined with reference to data
Shifting amount OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data for the period skipped
Directly deducted from OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made
Compare;
If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from
The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters
Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;
Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, it is inclined with reference to data
Shifting amount OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped from
Directly deducted in OFFSET_3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL
In the querying condition of sentence, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto
Afterbody COUNT datas are intercepted in cluster, obtain final Query Result;The COUNT is the number of data of every page.
The present invention generates corresponding inquiry plan by receiving and parsing through SQL request by Presto clusters, obtains and meets
Index range of the data of inquiry plan in Elasticsearch clusters;By progressively counting and calculating, orient to be looked into
Index and time interval where the target page data of inquiry;Time interval is added in the querying condition of former SQL statement from
The data message of page object is read in Elasticsearch clusters, so as to quickly positioning target page data position, is significantly subtracted
The reading of few redundant data, to improve the performance of random formfeed inquiry.
Brief description of the drawings
Fig. 1 is present system frame diagram;
Fig. 2 is target page data positioning flow figure of the present invention.
Embodiment
As shown in figure 1, as depicted in figs. 1 and 2, present invention is disclosed a kind of based on Presto's and Elasticsearch
Big data method for quickly querying, it comprises the following steps:
Step 1, the data being queried comprise at least the field of a timestamp type, and the mode being daily distributed is protected
In each index that elasticsearch be present;Example 2017-01-01 08:00:00 data are just stored in entitled 2017-01-
In 01 index;
Step 2, user submit SQL query to meet following form to presto servers, the SQL:
SELECT COLUMN1,COLUMN2...FROM TABLE[WHERE COLLECTTIME>、<,=' yyyy-MM-
dd HH:mm:Ss ' AND COLUMN1=' XXX '] ORDER BY COLLECTTIME [, COLUMN3, COLUMN4...]
[LIMIT OFFSET,LIMIT]
SQL is parsed and is generated corresponding inquiry plan by server, and the data inquired about so as to learn user to need are stored in
In elasticsearch which index, the index range of target data is obtained;
Step 3, the index range obtained according to step 2, with the querying condition submitted in the SQL, each rope is inquired about respectively
Meet the data total number of the condition in drawing;
Step 4, each index data bar number drawn according to step 3, with reference to user submit SQL in data offset
The OFFSET and number of data COUNT of every page obtains target data page and is located in which or which index, the index skipped
Number of data directly deducted from OFFSET, draw new offset OFFSET_1;
Step 5, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 4
Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period,
It is specific as follows:
Step 5.1, counted for unit, due to only including the data of one day in an index, therefore can be obtained by the hour
The number of data of 24 periods gone out among one day;With reference to the OFFSET_1 drawn in step 4, it can calculate what is inquired about
Which is data be located in period, and the period number of data can skipped directly deducts from OFFSET_1, draws new
OFFSET_2, OFFSET_2 is made comparisons with the threshold value M of systemic presupposition;
If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from
The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters
Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;
Step 5.2, using previous step draw hour section as querying condition, by minute be single to the data in the section
Position is counted, and is had 60 minutes due to 1 hour, therefore can draw qualified number of data per minute in the hour;Knot
Close the OFFSET_2 that draws in step 5.1, the data to be inquired about can be calculated and be located in which minutes section, skip when
Between segment data bar number can directly deducted from OFFSET_2, new OFFSET_3 is drawn, by OFFSET_3 and systemic presupposition
Threshold value M make comparisons;
If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from
The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters
Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;
Step 5.3, using previous step draw minute section as querying condition, by the second be unit to the data in the section
Counted, had 60 seconds due to 1 minute, therefore qualified number of data per second in this point can be drawn;With reference to step 5.2
In the OFFSET_3 that draws, the data to be inquired about can be calculated and be located in which period, the period number of data skipped
Can is directly deducted from OFFSET_3, draws new OFFSET_4, and the time range of the period is added into former SQL statement
Querying condition in, from Elasticsearch clusters read quantity be OFFSET_4+COUNT quantity, in Presto clusters
Middle interception afterbody COUNT datas, obtain final Query Result.
The present invention is queried packet field containing time and per diem the mode of subindex is stored in by all
In Elasticsearch clusters, SQL request is then received and parsed through by Presto clusters and generates corresponding inquiry plan, is obtained
Take index range of the data for meeting inquiry plan in Elasticsearch clusters;By progressively counting and calculating, orient
Index and time interval where the target page data to be inquired about;Time interval is added in the querying condition of former SQL statement
The data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position,
The reading of redundant data is greatly decreased, to improve the performance of random formfeed inquiry.
It is described above, only it is the embodiment of the present invention, is not intended to limit the scope of the present invention, thus it is every
Any subtle modifications, equivalent variations and modifications that technical spirit according to the present invention is made to above example, still fall within this
In the range of inventive technique scheme.
Claims (2)
- A kind of 1. big data method for quickly querying based on Presto and Elasticsearch, it is characterised in that:It is by all quilts Inquire about packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters, then pass through Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, and acquisition meets that the data of inquiry plan exist Index range in Elasticsearch clusters;By progressively counting and calculating, the target page data to be inquired about institute is oriented Index and time interval;Time interval is added in the querying condition of former SQL statement and read from Elasticsearch clusters Take the data message of page object.
- 2. a kind of big data method for quickly querying based on Presto and Elasticsearch according to claim 1, its It is characterised by:The big data method for quickly querying specifically includes following steps:Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met querying condition Index range of the data in elasticsearch clusters and the data total number for meeting querying condition;The number of data of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range obtains To target data page be located at which or which index in, the number of data for the index skipped directly deducts from OFFSET, obtains Go out new data offset OFFSET_1;Step 3, the DateHistogramAggregation interfaces by Elasticsearch, the number of targets obtained to step 2 The segmentation carried out temporally according to the index where page counts, and obtains the number of data for meeting condition in each period, specifically such as Under:Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, inclined with reference to data Shifting amount OFFSET_1, calculate the data to be inquired about and be located in which section hours, skip hours section data Bar number directly deducts from OFFSET_1, draws new data offset OFFSET_2, by OFFSET_2 and the threshold of systemic presupposition Value M makes comparisons;If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;Step 3.2, by minute be unit step 3.1 is drawn hour section data count, with reference to data offset OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data of the period skipped from Directly deducted in OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made into ratio Compared with;If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, with reference to data offset OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped is from OFFSET_ Directly deducted in 3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL statement In querying condition, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto clusters Afterbody COUNT datas are intercepted, obtain final Query Result;The COUNT is the number of data of every page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710900970.0A CN107748766B (en) | 2017-09-28 | 2017-09-28 | Big data fast query method based on Presto and elastic search |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710900970.0A CN107748766B (en) | 2017-09-28 | 2017-09-28 | Big data fast query method based on Presto and elastic search |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107748766A true CN107748766A (en) | 2018-03-02 |
CN107748766B CN107748766B (en) | 2021-08-24 |
Family
ID=61255198
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710900970.0A Active CN107748766B (en) | 2017-09-28 | 2017-09-28 | Big data fast query method based on Presto and elastic search |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107748766B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109739882A (en) * | 2019-01-04 | 2019-05-10 | 南威软件股份有限公司 | A kind of big data enquiring and optimizing method based on Presto and Elasticsearch |
CN110321388A (en) * | 2019-02-26 | 2019-10-11 | 南威软件股份有限公司 | A kind of quicksort querying method and system based on Greenplum |
CN110688357A (en) * | 2018-06-20 | 2020-01-14 | 华为技术有限公司 | Method and device for reading log type data |
CN111125178A (en) * | 2018-10-30 | 2020-05-08 | 亿度慧达教育科技(北京)有限公司 | Data query method, device, terminal, presto query engine and storage medium |
CN112612827A (en) * | 2020-12-25 | 2021-04-06 | 平安国际智慧城市科技股份有限公司 | Database paging query method and device, computer equipment and storage medium |
CN112650779A (en) * | 2021-01-12 | 2021-04-13 | 浪潮云信息技术股份公司 | Cloud auditing method based on ElasticSearch supporting deep page jump query |
CN113961573A (en) * | 2021-12-23 | 2022-01-21 | 北京力控元通科技有限公司 | Time sequence database query method and query system |
CN114072788A (en) * | 2019-07-02 | 2022-02-18 | 国际商业机器公司 | Random sampling from search engine |
CN114138773A (en) * | 2021-10-13 | 2022-03-04 | 浙江中控技术股份有限公司 | Rapid page turning method for database |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007079303A2 (en) * | 2005-12-29 | 2007-07-12 | Amazon Technologies, Inc. | Method and apparatus for a distributed file storage and indexing service |
CN102880685A (en) * | 2012-09-13 | 2013-01-16 | 北京航空航天大学 | Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size |
CN104965873A (en) * | 2015-06-10 | 2015-10-07 | 努比亚技术有限公司 | Paging inquiring method and apparatus |
CN107133267A (en) * | 2017-04-01 | 2017-09-05 | 北京京东尚科信息技术有限公司 | Inquire about method, device, electronic equipment and the readable storage medium storing program for executing of elasticsearch clusters |
-
2017
- 2017-09-28 CN CN201710900970.0A patent/CN107748766B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007079303A2 (en) * | 2005-12-29 | 2007-07-12 | Amazon Technologies, Inc. | Method and apparatus for a distributed file storage and indexing service |
CN102880685A (en) * | 2012-09-13 | 2013-01-16 | 北京航空航天大学 | Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size |
CN104965873A (en) * | 2015-06-10 | 2015-10-07 | 努比亚技术有限公司 | Paging inquiring method and apparatus |
CN107133267A (en) * | 2017-04-01 | 2017-09-05 | 北京京东尚科信息技术有限公司 | Inquire about method, device, electronic equipment and the readable storage medium storing program for executing of elasticsearch clusters |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110688357A (en) * | 2018-06-20 | 2020-01-14 | 华为技术有限公司 | Method and device for reading log type data |
CN110688357B (en) * | 2018-06-20 | 2021-08-20 | 华为技术有限公司 | Method and device for reading log type data |
CN111125178B (en) * | 2018-10-30 | 2021-05-28 | 亿度慧达教育科技(北京)有限公司 | Data query method, device, terminal, presto query engine and storage medium |
CN111125178A (en) * | 2018-10-30 | 2020-05-08 | 亿度慧达教育科技(北京)有限公司 | Data query method, device, terminal, presto query engine and storage medium |
CN109739882A (en) * | 2019-01-04 | 2019-05-10 | 南威软件股份有限公司 | A kind of big data enquiring and optimizing method based on Presto and Elasticsearch |
CN110321388B (en) * | 2019-02-26 | 2021-07-02 | 南威软件股份有限公司 | Quick sequencing query method and system based on Greenplus |
CN110321388A (en) * | 2019-02-26 | 2019-10-11 | 南威软件股份有限公司 | A kind of quicksort querying method and system based on Greenplum |
CN114072788A (en) * | 2019-07-02 | 2022-02-18 | 国际商业机器公司 | Random sampling from search engine |
US11797615B2 (en) | 2019-07-02 | 2023-10-24 | International Business Machines Corporation | Random sampling from a search engine |
CN112612827A (en) * | 2020-12-25 | 2021-04-06 | 平安国际智慧城市科技股份有限公司 | Database paging query method and device, computer equipment and storage medium |
CN112650779A (en) * | 2021-01-12 | 2021-04-13 | 浪潮云信息技术股份公司 | Cloud auditing method based on ElasticSearch supporting deep page jump query |
CN114138773A (en) * | 2021-10-13 | 2022-03-04 | 浙江中控技术股份有限公司 | Rapid page turning method for database |
CN113961573A (en) * | 2021-12-23 | 2022-01-21 | 北京力控元通科技有限公司 | Time sequence database query method and query system |
Also Published As
Publication number | Publication date |
---|---|
CN107748766B (en) | 2021-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107748766A (en) | A kind of big data method for quickly querying based on Presto and Elasticsearch | |
Agarwal et al. | BlinkDB: queries with bounded errors and bounded response times on very large data | |
CN104424258B (en) | Multidimensional data query method, query server, column storage server and system | |
US9128983B2 (en) | Systems and methods for query optimization | |
CN106649670A (en) | Streaming computing-based data monitoring method and apparatus | |
CN102521406A (en) | Distributed query method and system for complex task of querying massive structured data | |
CN102521405A (en) | Massive structured data storage and query methods and systems supporting high-speed loading | |
CN105045932A (en) | Data paging inquiry method based on descending order storage | |
CN104252536A (en) | Hbase-based internet log data inquiring method and device | |
CN103235796B (en) | Search method and system based on user click behavior | |
CN106021357B (en) | Based on distributed big data paging query method and system | |
CN107766413B (en) | Method for realizing real-time data stream aggregation query | |
CN111552885B (en) | System and method for realizing automatic real-time message pushing operation | |
CN113609374A (en) | Data processing method, device and equipment based on content push and storage medium | |
KR20160053933A (en) | Smart search refinement | |
CN105426449A (en) | Method and device for massive data query and server | |
US8909619B1 (en) | Providing search results tools | |
CA2901685C (en) | Crowdsourcing user-provided identifiers and associating them with brand identities | |
CN109739882A (en) | A kind of big data enquiring and optimizing method based on Presto and Elasticsearch | |
Amagata et al. | Sliding window top-k dominating query processing over distributed data streams | |
CN104123329A (en) | Search method and device | |
CN110781210A (en) | Data processing platform for multi-dimensional aggregation real-time query of large-scale data | |
CN111680072B (en) | System and method for dividing social information data | |
US9405846B2 (en) | Publish-subscribe based methods and apparatuses for associating data files | |
CN109063201B (en) | Impala online interactive query method based on mixed storage scheme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |