CN107748766A - A kind of big data method for quickly querying based on Presto and Elasticsearch - Google Patents

A kind of big data method for quickly querying based on Presto and Elasticsearch Download PDF

Info

Publication number
CN107748766A
CN107748766A CN201710900970.0A CN201710900970A CN107748766A CN 107748766 A CN107748766 A CN 107748766A CN 201710900970 A CN201710900970 A CN 201710900970A CN 107748766 A CN107748766 A CN 107748766A
Authority
CN
China
Prior art keywords
data
offset
elasticsearch
clusters
presto
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710900970.0A
Other languages
Chinese (zh)
Other versions
CN107748766B (en
Inventor
洪灿榕
吴晓梅
李明溪
蔡炜榕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Linewell Software Co Ltd
Original Assignee
Linewell Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Linewell Software Co Ltd filed Critical Linewell Software Co Ltd
Priority to CN201710900970.0A priority Critical patent/CN107748766B/en
Publication of CN107748766A publication Critical patent/CN107748766A/en
Application granted granted Critical
Publication of CN107748766B publication Critical patent/CN107748766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters by all, then SQL request is received and parsed through by Presto clusters and generates corresponding inquiry plan, obtain index range of the data for meeting inquiry plan in Elasticsearch clusters;By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented;Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position, is greatly decreased the reading of redundant data, to improve the performance of random formfeed inquiry.

Description

A kind of big data method for quickly querying based on Presto and Elasticsearch
Technical field
The present invention relates to the method for quickly querying of big data, specifically provide it is a kind of based on Presto and Elasticsearch big data method for quickly querying.
Background technology
Elasticsearch is one and established in full-text search engine Apache LuceneTMOn the basis of real-time analysis Distributed search engine, it realizes the function of all indexes and search using Lucene as core so that each document Content can be indexed, searches for, sorts, filter.Simultaneously, there is provided abundant polymerizable functional, multidimensional can be carried out to data Degree analysis.But elasticsearch lacks traditional SQL syntax and supports that developer's use is more difficult, using relevant database as System its Data Migration on basis, mating operation are not easy to carry out, especially when user carries out inquiry and page turning, complicated inquiry Make elasticsearch servers for a long time in the state of high capacity, while in order to turn over random, large-scale page turning Substantial amounts of number of pages is crossed, have read very more data that can't be used.And presto can be provided substantially for elasticsearch SQL syntax support that but its inquiry mechanism based on internal memory is also required to substantial amounts of target data to read in advance in cluster memory, disappears Substantial amounts of server resource and time are consumed, also fails to fundamentally solve the problem.
The content of the invention
It is an object of the invention to provide a kind of big data quick search side based on Presto and Elasticsearch Method, it carries out SQL reception and parsing using presto clusters, with reference to elasticsearch Date Histogram Aggregation aggregate statistics functions, progressively position target page data, realize the quick search of data.
To achieve the above object, the technical solution adopted by the present invention is:
A kind of big data method for quickly querying based on Presto and Elasticsearch, it is queried data by all Comprising time field and per diem the mode of subindex is stored in Elasticsearch clusters, is then connect by Presto clusters Receive and parse SQL request and generate corresponding inquiry plan, obtain the data for meeting inquiry plan in Elasticsearch clusters Index range;By progressively counting and calculating, index and time interval where the target page data to be inquired about are oriented; Time interval is added in the querying condition of former SQL statement and the data message of page object is read from Elasticsearch clusters.
The big data method for quickly querying specifically includes following steps:
Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met inquiry bar Index range of the data of part in elasticsearch clusters and the data total number for meeting querying condition;
The data strip of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range Number obtain target data page be located at which or which index in, the number of data for the index skipped directly is detained from OFFSET Remove, draw new data offset OFFSET_1;
Step 3, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 2 Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period, It is specific as follows:
Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, with reference to step The data offset OFFSET_1 drawn in rapid 2, calculate the data to be inquired about and be located in which section hours, skip Hours, the number of data of section directly deducted from OFFSET_1, new data offset OFFSET_2 was drawn, by OFFSET_ 2 make comparisons with the threshold value M of systemic presupposition;
If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;
Step 3.2, by minute be unit step 3.1 is drawn hour section data count, it is inclined with reference to data Shifting amount OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data for the period skipped Directly deducted from OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made Compare;
If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;
Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, it is inclined with reference to data Shifting amount OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped from Directly deducted in OFFSET_3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL In the querying condition of sentence, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto Afterbody COUNT datas are intercepted in cluster, obtain final Query Result;The COUNT is the number of data of every page.
The present invention generates corresponding inquiry plan by receiving and parsing through SQL request by Presto clusters, obtains and meets Index range of the data of inquiry plan in Elasticsearch clusters;By progressively counting and calculating, orient to be looked into Index and time interval where the target page data of inquiry;Time interval is added in the querying condition of former SQL statement from The data message of page object is read in Elasticsearch clusters, so as to quickly positioning target page data position, is significantly subtracted The reading of few redundant data, to improve the performance of random formfeed inquiry.
Brief description of the drawings
Fig. 1 is present system frame diagram;
Fig. 2 is target page data positioning flow figure of the present invention.
Embodiment
As shown in figure 1, as depicted in figs. 1 and 2, present invention is disclosed a kind of based on Presto's and Elasticsearch Big data method for quickly querying, it comprises the following steps:
Step 1, the data being queried comprise at least the field of a timestamp type, and the mode being daily distributed is protected In each index that elasticsearch be present;Example 2017-01-01 08:00:00 data are just stored in entitled 2017-01- In 01 index;
Step 2, user submit SQL query to meet following form to presto servers, the SQL:
SELECT COLUMN1,COLUMN2...FROM TABLE[WHERE COLLECTTIME>、<,=' yyyy-MM- dd HH:mm:Ss ' AND COLUMN1=' XXX '] ORDER BY COLLECTTIME [, COLUMN3, COLUMN4...] [LIMIT OFFSET,LIMIT]
SQL is parsed and is generated corresponding inquiry plan by server, and the data inquired about so as to learn user to need are stored in In elasticsearch which index, the index range of target data is obtained;
Step 3, the index range obtained according to step 2, with the querying condition submitted in the SQL, each rope is inquired about respectively Meet the data total number of the condition in drawing;
Step 4, each index data bar number drawn according to step 3, with reference to user submit SQL in data offset The OFFSET and number of data COUNT of every page obtains target data page and is located in which or which index, the index skipped Number of data directly deducted from OFFSET, draw new offset OFFSET_1;
Step 5, the Date Histogram Aggregation interfaces by Elasticsearch, are obtained to step 4 Index where target data page carries out segmentation statistics temporally, obtains the number of data for meeting condition in each period, It is specific as follows:
Step 5.1, counted for unit, due to only including the data of one day in an index, therefore can be obtained by the hour The number of data of 24 periods gone out among one day;With reference to the OFFSET_1 drawn in step 4, it can calculate what is inquired about Which is data be located in period, and the period number of data can skipped directly deducts from OFFSET_1, draws new OFFSET_2, OFFSET_2 is made comparisons with the threshold value M of systemic presupposition;
If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;
Step 5.2, using previous step draw hour section as querying condition, by minute be single to the data in the section Position is counted, and is had 60 minutes due to 1 hour, therefore can draw qualified number of data per minute in the hour;Knot Close the OFFSET_2 that draws in step 5.1, the data to be inquired about can be calculated and be located in which minutes section, skip when Between segment data bar number can directly deducted from OFFSET_2, new OFFSET_3 is drawn, by OFFSET_3 and systemic presupposition Threshold value M make comparisons;
If less than equal to M, then the time range of the period is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_4+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;
Step 5.3, using previous step draw minute section as querying condition, by the second be unit to the data in the section Counted, had 60 seconds due to 1 minute, therefore qualified number of data per second in this point can be drawn;With reference to step 5.2 In the OFFSET_3 that draws, the data to be inquired about can be calculated and be located in which period, the period number of data skipped Can is directly deducted from OFFSET_3, draws new OFFSET_4, and the time range of the period is added into former SQL statement Querying condition in, from Elasticsearch clusters read quantity be OFFSET_4+COUNT quantity, in Presto clusters Middle interception afterbody COUNT datas, obtain final Query Result.
The present invention is queried packet field containing time and per diem the mode of subindex is stored in by all In Elasticsearch clusters, SQL request is then received and parsed through by Presto clusters and generates corresponding inquiry plan, is obtained Take index range of the data for meeting inquiry plan in Elasticsearch clusters;By progressively counting and calculating, orient Index and time interval where the target page data to be inquired about;Time interval is added in the querying condition of former SQL statement The data message of page object is read from Elasticsearch clusters.Fast positioning of the present invention target page data position, The reading of redundant data is greatly decreased, to improve the performance of random formfeed inquiry.
It is described above, only it is the embodiment of the present invention, is not intended to limit the scope of the present invention, thus it is every Any subtle modifications, equivalent variations and modifications that technical spirit according to the present invention is made to above example, still fall within this In the range of inventive technique scheme.

Claims (2)

  1. A kind of 1. big data method for quickly querying based on Presto and Elasticsearch, it is characterised in that:It is by all quilts Inquire about packet field containing time and per diem the mode of subindex is stored in Elasticsearch clusters, then pass through Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, and acquisition meets that the data of inquiry plan exist Index range in Elasticsearch clusters;By progressively counting and calculating, the target page data to be inquired about institute is oriented Index and time interval;Time interval is added in the querying condition of former SQL statement and read from Elasticsearch clusters Take the data message of page object.
  2. 2. a kind of big data method for quickly querying based on Presto and Elasticsearch according to claim 1, its It is characterised by:The big data method for quickly querying specifically includes following steps:
    Step 1, Presto clusters receive and parse through SQL request and generate corresponding inquiry plan, are met querying condition Index range of the data in elasticsearch clusters and the data total number for meeting querying condition;
    The number of data of each index in step 2, the SQL submitted according to user in data offset OFFSET and index range obtains To target data page be located at which or which index in, the number of data for the index skipped directly deducts from OFFSET, obtains Go out new data offset OFFSET_1;
    Step 3, the DateHistogramAggregation interfaces by Elasticsearch, the number of targets obtained to step 2 The segmentation carried out temporally according to the index where page counts, and obtains the number of data for meeting condition in each period, specifically such as Under:
    Index where step 3.1, the target data page obtained by the hour for unit to step 2 counts, inclined with reference to data Shifting amount OFFSET_1, calculate the data to be inquired about and be located in which section hours, skip hours section data Bar number directly deducts from OFFSET_1, draws new data offset OFFSET_2, by OFFSET_2 and the threshold of systemic presupposition Value M makes comparisons;
    If less than equal to M, then the time range of hours section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_2+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_2 is more than M, enter in next step;
    Step 3.2, by minute be unit step 3.1 is drawn hour section data count, with reference to data offset OFFSET_2, calculate the data to be inquired about and be located in which minutes section, the number of data of the period skipped from Directly deducted in OFFSET_2, draw new data offset OFFSET_3, the threshold value M of OFFSET_3 and systemic presupposition is made into ratio Compared with;
    If less than equal to M, then the time range of the minutes section is added in the querying condition of former SQL statement, from The quantity that quantity is OFFSET_3+COUNT is read in Elasticsearch clusters, afterbody COUNT is intercepted in Presto clusters Data, obtain final Query Result;If OFFSET_3 is more than M, enter in next step;
    Step 3.3, by the second be unit to obtained in step 3.2 minute section data count, with reference to data offset OFFSET_3, calculate the data to be inquired about and be located in which second period, the period number of data skipped is from OFFSET_ Directly deducted in 3, draw new data offset OFFSET_4, the time range of this second period is added into former SQL statement In querying condition, the quantity that quantity is OFFSET_4+COUNT is read from Elasticsearch clusters, in Presto clusters Afterbody COUNT datas are intercepted, obtain final Query Result;
    The COUNT is the number of data of every page.
CN201710900970.0A 2017-09-28 2017-09-28 Big data fast query method based on Presto and elastic search Active CN107748766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710900970.0A CN107748766B (en) 2017-09-28 2017-09-28 Big data fast query method based on Presto and elastic search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710900970.0A CN107748766B (en) 2017-09-28 2017-09-28 Big data fast query method based on Presto and elastic search

Publications (2)

Publication Number Publication Date
CN107748766A true CN107748766A (en) 2018-03-02
CN107748766B CN107748766B (en) 2021-08-24

Family

ID=61255198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710900970.0A Active CN107748766B (en) 2017-09-28 2017-09-28 Big data fast query method based on Presto and elastic search

Country Status (1)

Country Link
CN (1) CN107748766B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739882A (en) * 2019-01-04 2019-05-10 南威软件股份有限公司 A kind of big data enquiring and optimizing method based on Presto and Elasticsearch
CN110321388A (en) * 2019-02-26 2019-10-11 南威软件股份有限公司 A kind of quicksort querying method and system based on Greenplum
CN110688357A (en) * 2018-06-20 2020-01-14 华为技术有限公司 Method and device for reading log type data
CN111125178A (en) * 2018-10-30 2020-05-08 亿度慧达教育科技(北京)有限公司 Data query method, device, terminal, presto query engine and storage medium
CN112612827A (en) * 2020-12-25 2021-04-06 平安国际智慧城市科技股份有限公司 Database paging query method and device, computer equipment and storage medium
CN112650779A (en) * 2021-01-12 2021-04-13 浪潮云信息技术股份公司 Cloud auditing method based on ElasticSearch supporting deep page jump query
CN113961573A (en) * 2021-12-23 2022-01-21 北京力控元通科技有限公司 Time sequence database query method and query system
CN114072788A (en) * 2019-07-02 2022-02-18 国际商业机器公司 Random sampling from search engine

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007079303A2 (en) * 2005-12-29 2007-07-12 Amazon Technologies, Inc. Method and apparatus for a distributed file storage and indexing service
CN102880685A (en) * 2012-09-13 2013-01-16 北京航空航天大学 Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size
CN104965873A (en) * 2015-06-10 2015-10-07 努比亚技术有限公司 Paging inquiring method and apparatus
CN107133267A (en) * 2017-04-01 2017-09-05 北京京东尚科信息技术有限公司 Inquire about method, device, electronic equipment and the readable storage medium storing program for executing of elasticsearch clusters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007079303A2 (en) * 2005-12-29 2007-07-12 Amazon Technologies, Inc. Method and apparatus for a distributed file storage and indexing service
CN102880685A (en) * 2012-09-13 2013-01-16 北京航空航天大学 Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size
CN104965873A (en) * 2015-06-10 2015-10-07 努比亚技术有限公司 Paging inquiring method and apparatus
CN107133267A (en) * 2017-04-01 2017-09-05 北京京东尚科信息技术有限公司 Inquire about method, device, electronic equipment and the readable storage medium storing program for executing of elasticsearch clusters

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688357A (en) * 2018-06-20 2020-01-14 华为技术有限公司 Method and device for reading log type data
CN110688357B (en) * 2018-06-20 2021-08-20 华为技术有限公司 Method and device for reading log type data
CN111125178A (en) * 2018-10-30 2020-05-08 亿度慧达教育科技(北京)有限公司 Data query method, device, terminal, presto query engine and storage medium
CN111125178B (en) * 2018-10-30 2021-05-28 亿度慧达教育科技(北京)有限公司 Data query method, device, terminal, presto query engine and storage medium
CN109739882A (en) * 2019-01-04 2019-05-10 南威软件股份有限公司 A kind of big data enquiring and optimizing method based on Presto and Elasticsearch
CN110321388A (en) * 2019-02-26 2019-10-11 南威软件股份有限公司 A kind of quicksort querying method and system based on Greenplum
CN110321388B (en) * 2019-02-26 2021-07-02 南威软件股份有限公司 Quick sequencing query method and system based on Greenplus
CN114072788A (en) * 2019-07-02 2022-02-18 国际商业机器公司 Random sampling from search engine
US11797615B2 (en) 2019-07-02 2023-10-24 International Business Machines Corporation Random sampling from a search engine
CN112612827A (en) * 2020-12-25 2021-04-06 平安国际智慧城市科技股份有限公司 Database paging query method and device, computer equipment and storage medium
CN112650779A (en) * 2021-01-12 2021-04-13 浪潮云信息技术股份公司 Cloud auditing method based on ElasticSearch supporting deep page jump query
CN113961573A (en) * 2021-12-23 2022-01-21 北京力控元通科技有限公司 Time sequence database query method and query system

Also Published As

Publication number Publication date
CN107748766B (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN107748766A (en) A kind of big data method for quickly querying based on Presto and Elasticsearch
Agarwal et al. BlinkDB: queries with bounded errors and bounded response times on very large data
CN104424258B (en) Multidimensional data query method, query server, column storage server and system
US20120095987A1 (en) Systems and methods for query optimization
CN106649670A (en) Streaming computing-based data monitoring method and apparatus
CN102521406A (en) Distributed query method and system for complex task of querying massive structured data
CN102521405A (en) Massive structured data storage and query methods and systems supporting high-speed loading
CN105045932A (en) Data paging inquiry method based on descending order storage
CN104252536A (en) Hbase-based internet log data inquiring method and device
CN103235796B (en) Search method and system based on user click behavior
CN106021357B (en) Based on distributed big data paging query method and system
CN111552885B (en) System and method for realizing automatic real-time message pushing operation
CN107766413B (en) Method for realizing real-time data stream aggregation query
KR20160053933A (en) Smart search refinement
CN103744913A (en) Database retrieval method based on search engine technology
CN113609374A (en) Data processing method, device and equipment based on content push and storage medium
CN105426449A (en) Method and device for massive data query and server
CN109739882A (en) A kind of big data enquiring and optimizing method based on Presto and Elasticsearch
US9674134B2 (en) Crowdsourcing user-provided identifiers and associating them with brand identities
US8909619B1 (en) Providing search results tools
Amagata et al. Sliding window top-k dominating query processing over distributed data streams
CN104123329A (en) Search method and device
CN114185885A (en) Streaming data processing method and system based on column storage database
CN111680072B (en) System and method for dividing social information data
CN103530344A (en) Real-time correction method for search words based on improved TF-IDF method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant