CN112445776B - Presto-based dynamic barrel dividing method, system, equipment and readable storage medium - Google Patents

Presto-based dynamic barrel dividing method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN112445776B
CN112445776B CN202011310738.XA CN202011310738A CN112445776B CN 112445776 B CN112445776 B CN 112445776B CN 202011310738 A CN202011310738 A CN 202011310738A CN 112445776 B CN112445776 B CN 112445776B
Authority
CN
China
Prior art keywords
bucket
memory
node
query
barrel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011310738.XA
Other languages
Chinese (zh)
Other versions
CN112445776A (en
Inventor
于扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Analysys Digital Intelligence Technology Co ltd
Original Assignee
Beijing Analysys Think Tank Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Analysys Think Tank Network Technology Co ltd filed Critical Beijing Analysys Think Tank Network Technology Co ltd
Priority to CN202011310738.XA priority Critical patent/CN112445776B/en
Publication of CN112445776A publication Critical patent/CN112445776A/en
Application granted granted Critical
Publication of CN112445776B publication Critical patent/CN112445776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Abstract

The embodiment of the application discloses a Presto-based dynamic barrel dividing method, a system, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a data set, sequencing the data set according to a logic main key, pre-partitioning a bucket according to the value range of the logic main key and the number of the calculation nodes, and storing a bucket partitioning file; the sub-bucket files are sorted according to the logic main key; determining an estimated query peak memory according to the data volume and the historical query record; calculating a target barrel number and a single-node barrel concurrency number according to the estimated query peak value memory, the number of queries currently executed by the current system and the number of calculation nodes of the current distributed calculation system; adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket; and segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket. The method saves disks and network IO and exerts the optimal calculation performance under the limited CPU and memory.

Description

Presto-based dynamic barreling method, system, equipment and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of big data, in particular to a Presto-based dynamic barrel dividing method, system, equipment and readable storage medium.
Background
Presto is a facebook open-source MPP architecture distributed SQL query engine, is suitable for interactive analysis query, and supports GB to PB bytes in data volume.
In the process of computing, presto needs to split data and load the data into memories of various worker nodes to complete computing, under some complex computing scenarios, such as aggregation, window and table connection, an intermediate result is usually large and may exceed the maximum value of an available memory of a single computing task by a system on each computing node, namely, a query.max-memory-per-node parameter configured in a system parameter, if a hill down function (which is obtained by a system parameter extra-memory-enabled = true) is started, a disk overflow function is triggered, otherwise, a Query exceeded user memory limit of XXX (value of the query.max-memory-per-node) is reported directly, and the Query fails.
Although the Spill down function can avoid the problem of query failure caused by insufficient memory to a certain extent, two pain points still exist in practical use: the first scenario is as follows: spill down will greatly reduce Presto query performance. In the second scenario, the disk space occupied by the spill down is limited, and once the intermediate data is too large, the query still fails. Aiming at the scene, the Presto community designs an optimization scheme: dividing data into buckets, namely dividing the data into certain rules (usually Hash) according to a primary key (some data warehouses such as hives do not have the primary key, but a logical primary key can be set according to business logic in practical use), executing the data into the buckets by using a configuration parameter group-execution-enabled = true during calculation, and executing the data into aggregation, window or join according to the logical primary key in a batch mode by using one bucket, namely dividing a large calculation task into a plurality of serial small tasks, and finally summarizing calculation results of the small tasks.
The solution reduces the usage amount of the peak memory to a certain extent, but has some problems in practical use: 1. original data are not bucket tables, and if the data need to be bucket again, ETL needs to be operated again, which wastes time and labor. 2. As projects evolve, incremental data per unit time may grow tens or even hundreds of times, and even with group execute, still may trigger a spin down or lack of memory. 3. In the second scenario, the ETL needs to be re-run to increase the number of buckets. 4. If the table is built in the initial stage of the project and divided into many buckets, there will be many small files, which puts a large pressure on the file system and presto scheduling.
Disclosure of Invention
Therefore, embodiments of the present application provide a method, a system, a device, and a readable storage medium for dynamic barreling based on Presto, where a file is pre-split according to a value range of a logical primary key, the interior of the pre-split file is sorted according to the logical primary key, and the dynamic primary key range is used for barreling, so that a shuffle is not required in a barreling process, only a simple split is required in the interior of the file according to a range, and a disk and a network IO are greatly saved. The estimation of the number of the buckets and the parallelism is carried out according to available resources, the concurrency and historical query records, the query flexibility is guaranteed, and the optimal computing performance is exerted under the limited CPU and the limited memory.
In order to achieve the above object, the embodiments of the present application provide the following technical solutions:
according to a first aspect of embodiments of the present application, a Presto-based dynamic bucket dividing method is provided, where the method includes:
acquiring a data set, sequencing the data set according to a logic main key, pre-partitioning a barrel according to the value range of the logic main key and the number of calculation nodes, and storing a barrel partitioning file; the sub-bucket files are sorted according to the logic main key;
determining an estimated query peak memory according to the data volume and the historical query record;
calculating the number of target barrels and the number of single-node barrel concurrence according to the estimated query peak value memory, the number of queries being executed by the current system and the number of calculation nodes of the current distributed calculation system;
adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket;
and segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket.
Optionally, the calculating a target bucket number and a concurrent number of single-node buckets according to the estimated query peak memory, the number of queries being executed by the current system, and the number of computing nodes of the current distributed computing system includes:
estimating a pre-estimated query peak memory M1 when the single node is executed;
estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels;
calculating the maximum occupied memory M5= MIN (M3, M4) of a single computing node according to the allowance M3 of the current Presto general memory pool and the maximum memory usage M4 of the single querying node;
if M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2X K and M5 is less than M2X (K + 1);
if M5< M2, the integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal number of split buckets is B × B, and the number of concurrent buckets is 1.
Optionally, the adjusting the granularity of the sub-buckets in the query being executed according to the target bucket number and the concurrent number of the single-node bucket includes:
configuring a self-contained session parameter of a system according to the target barrel number, current _ lifespans _ per _ task;
and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
Optionally, the storing the bucket file includes:
and storing the barrel files in a columnar storage format with sparse indexes.
According to a second aspect of the embodiments of the present application, there is provided a Presto-based dynamic bucket dividing system, where the system includes:
the pre-bucket-dividing module is used for acquiring a data set, sorting the data set according to a logic main key, performing pre-bucket division according to the value range of the logic main key and the number of the calculation nodes, and storing a bucket-dividing file; the sub-bucket files are sorted according to the logic main key;
the query peak memory estimation module is used for determining an estimated query peak memory according to the data volume and the historical query records;
the calculation module is used for calculating the target barrel number and the concurrent number of the single-node barrels according to the pre-estimated query peak memory, the number of queries being executed by the current system and the number of calculation nodes of the current distributed calculation system;
the adjusting module is used for adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket;
and the segmentation module is used for segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket.
Optionally, the calculation module is specifically configured to:
estimating a pre-estimated query peak memory M1 when the single node is executed;
estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels;
calculating the maximum available memory M5= MIN (M3, M4) of the single computing node according to the allowance M3 of the current Presto general memory pool and the maximum memory usage M4 of the single querying node;
if M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2X K and M5 is less than M2X (K + 1);
if M5< M2, the integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal number of split buckets is B × B, and the number of concurrent buckets is 1.
Optionally, the adjusting module is specifically configured to:
configuring a self-contained session parameter of a system according to the target barrel number, current _ lifespans _ per _ task;
and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
Optionally, the pre-bucket sorting module is specifically configured to:
and storing the barrel files in a columnar storage format with sparse indexes.
According to a third aspect of embodiments herein, there is provided an apparatus comprising: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor is configured to execute one or more program instructions to perform the method of any of the first aspect.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of the first aspects.
To sum up, the embodiment of the present application provides a method, a system, a device, and a readable storage medium for dynamic barrel partitioning based on Presto, by acquiring a data set, sorting the data set according to a logical primary key, performing barrel pre-partitioning according to a value range of the logical primary key and a number of compute nodes, and storing a barrel partitioning file; the sub-bucket files are sorted according to the logic main key; determining an estimated query peak memory according to the data volume and the historical query record; calculating a target barrel number and a single-node barrel concurrency number according to the estimated query peak value memory, the number of queries currently executed by the current system and the number of calculation nodes of the current distributed calculation system; adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket; and segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket. The method has the advantages that the file is pre-split according to the value range of the logic primary key, the interior of the pre-split file is sorted according to the logic primary key, and the barrel is split according to the dynamic primary key range, so that the barrel re-splitting process is not required to be carried out, only simple splitting is carried out according to the range in the file, and the disks and network IO are greatly saved. The bucket number and the parallelism degree are estimated according to available resources, the concurrency degree and historical query records, the query flexibility is guaranteed, and the optimal computing performance is exerted under a limited CPU and a limited memory.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary and that other implementation drawings may be derived from the provided drawings by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art will understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the ratio relationships, or adjustments of the sizes, should still fall within the scope covered by the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.
Fig. 1 is a schematic flow chart of a Presto-based dynamic barreling method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a file pre-splitting effect in ETL according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a barrel separation effect provided by an embodiment of the present application;
fig. 4 is a schematic diagram illustrating a data re-importing effect when adjusting the number of buckets by using a rebuild table according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a Presto-based dynamic barreling system according to an embodiment of the present application.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In view of the disadvantages of the existing bucket computing technology, fig. 1 shows a schematic flow chart of a Presto-based dynamic bucket computing method provided by an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:
step 101: acquiring a data set, sequencing the data set according to a logic main key, pre-partitioning a barrel according to the value range of the logic main key and the number of calculation nodes, and storing a barrel partitioning file; and the sub-bucket files are sorted according to the logic main key.
Step 102: and determining the memory of the estimated query peak value according to the data volume and the historical query record.
Step 103: and calculating the target barrel number and the concurrent number of the single-node barrels according to the pre-estimated query peak value memory, the number of queries currently executed by the current system and the number of calculated nodes of the current distributed calculation system.
Step 104: and adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket.
Step 105: and segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket.
In a possible implementation manner, in step 101, the storing the bucketed file includes: and storing the barrel files in a columnar storage format with sparse indexes. Apache ORC, apache queue, and no sql storage engines Kudu, HBase for order assurance may also be used.
In a possible implementation manner, in step 103, the calculating a target bucket number and a single-node bucket concurrency number according to the estimated query peak memory, the number of queries being executed by the current system, and the number of calculation nodes of the current distributed calculation system includes:
estimating a pre-estimated query peak memory M1 when the single node is executed; estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels; calculating the maximum available memory M5= MIN (M3, M4) of a single computing node according to the margin M3 of a current Presto General memory Pool (70% of the size of the JVM heap by default) and the maximum memory usage M4 of the single querying node (configured by a system parameter, namely, query. If M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2X K and M5 is less than M2X (K + 1); if M5< M2, an integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal split bucket number is B × B and the bucket concurrency number is 1.
In one possible implementation, in step 104, the adjusting the granularity of the sub-buckets in the executing query according to the target bucket number and the single-node bucket concurrency number includes: configuring a system self-contained session parameter current _ lifespans _ per _ task according to the target barrel number; and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
The embodiment of the application relates to the following dynamic barrel dividing technology, and the solution is divided into 2 steps:
the first part, data storage: and sequencing the complete data set according to the logic main key, making a pre-sorting bucket according to the value range of the main key and the number of the calculation nodes, and sequencing the interior of the bucket according to the service logic.
The following are use examples: for example, the logical primary key is a hash long integer, and the range of the logical primary key is [ 9223372036854775808-92233036854775807 ]. If there are 4 compute nodes in the compute resource, the data may be segmented into 4 × n segments according to the value range of the logical primary key, for example, 4 segments, and the value range of each segment is:
[-9223372036854775808,-4611686018427387905],
[-4611686018427387904,-1],
[0,4611686018427387903],
[4611686018427387904,9223372036854775807]
the file splitting effect of the process is shown in fig. 2. The inside of the file is stored according to the mode that the main keys are orderly, and the value range of the main keys is represented in the file name so as to facilitate the subsequent splitting and matching of the barrel. Files are stored in a sparse-indexed columnar storage format, such as Apache ORC, apache queue.
The data splitting and sorting processes all need to use ETL components, such as MapReduce, spark, and flight.
The data storage may use Apache ORC, apache queue, or no sql storage engines Kudu, HBase for ensuring the order.
The second part, data calculation, presto, is an optimization requirement based on the data of the previous part, including:
1. and estimating the required peak memory of the current query according to the data volume and the historical query record.
2. And calculating the optimal barrel number and the single-node barrel concurrency number according to the estimated peak value memory, the number of queries currently executed by the current system and the number of calculation nodes of the current distributed calculation system.
And (3) calculating a rule:
(1) The peak memory M1 is estimated and queried when the estimation is performed at a single node.
(2) According to the number B of the pre-divided buckets, estimating the peak memory M2= M1/B required by running one bucket at a time.
(3) According to the margin M3 of the current Presto General memory Pool (70% of the JVM heap size by default), the maximum memory usage M4 of a single computing node is queried once (through the system parameter query. Max-total-memory-per-node configuration), and the maximum available memory M5= MIN (M3, M4) of the single computing node is calculated.
(4) If M5> M2, the optimal bucket number is pre-bucket number B, and bucket concurrency K is calculated such that M5> M2 × K and M5< M2 × (K + 1).
(5) If M5< M2, an integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal split bucket number is B × B and the bucket concurrency number is 1.
3. The bucket granularity in the query here is adjusted. For example: still take the pre-bucket division into 4 files for analysis, for example, if the optimal number of buckets obtained in 2 is 8, the dynamic bucket division logic equally divides the long value range into 8 segments, each segment is 1 bucket, and the value range of each bucket is:
BUCKET 0:[-9223372036854775808,-6917529027641081857]
BUCKET 1:[-6917529027641081856,-4611686018427387905]
BUCKET 2:[-4611686018427387904,-2305843009213693953]
BUCKET 3:[-2305843009213693952,-1]
BUCKET 4:[0,2305843009213693951]
BUCKET 5:[2305843009213693952,4611686018427387903]
BUCKET 6:[4611686018427387904,6917529027641081855]
BUCKET 7:[6917529027641081856,9223372036854775807]
the specific regulation rule is as follows:
and according to the calculated bucket concurrency number in step2, configuring a self-contained session parameter current _ lifespans _ per _ task of the system.
And configuring a self-defined session parameter dynamic _ bucket _ num according to the optimal bucket dividing number calculated in step 2.
4. And (3) carrying out file segmentation according to the file name, the sparse index inside the file and the bucket value range obtained in the step (3), wherein in the above example, one range file is divided into 4 range sub-buckets, and the effect is shown in fig. 3.
5. The rest of the bucket scheduling scheme can be the Presto native scheduling scheme.
It can be seen that, in the embodiment of the present application, by performing file pre-splitting according to the value range of the logical primary key, where the pre-split files are sorted according to the logical primary key and partitioned into buckets through the dynamic primary key range, the three points ensure that a shuffle process does not need to be performed (as shown in fig. 4), and fig. 4 shows that if the dynamic binning is not performed, the number of the partitioned buckets is adjusted through the rebuilding table, and the data is reintroduced into the effect diagram. Only simple splitting is needed inside the file according to the range (as shown in fig. 2), and the disks and network IO are greatly saved. The estimation of the number of the buckets and the parallelism is carried out according to available resources, the concurrency and historical query records, the query flexibility is guaranteed, and the optimal computing performance is exerted under the limited CPU and the limited memory.
In summary, the embodiment of the present application provides a method for dynamically partitioning buckets based on Presto, which includes obtaining a data set, sorting the data set according to a logic primary key, performing pre-partitioning on buckets according to a value range of the logic primary key and a calculation node number, and storing a partitioned bucket file; the sub-bucket files are sorted according to the logic main key; determining an estimated query peak memory according to the data volume and the historical query record; calculating a target barrel number and a single-node barrel concurrency number according to the estimated query peak value memory, the number of queries currently executed by the current system and the number of calculation nodes of the current distributed calculation system; adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket; and segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket. The method has the advantages that the file is pre-split according to the value range of the logic main key, the interior of the pre-split file is sorted according to the logic main key, and the barrel is split according to the dynamic main key range, so that the barrel splitting process is not required to be carried out, only simple splitting is carried out according to the range in the file, and the disks and network IO are greatly saved. The bucket number and the parallelism degree are estimated according to available resources, the concurrency degree and historical query records, the query flexibility is guaranteed, and the optimal computing performance is exerted under a limited CPU and a limited memory.
Based on the same technical concept, the embodiment of the present application further provides a Presto-based dynamic bucket dividing system, as shown in fig. 5, the system includes:
a pre-partitioning module 501, configured to obtain a data set, sort the data set according to a logical primary key, pre-partition a bucket according to a value range of the logical primary key and the number of the computing nodes, and store a bucket-partitioning file; the sub-bucket files are sorted according to the logic main key;
a query peak memory estimation module 502, configured to determine an estimated query peak memory according to the data volume and the historical query records;
a calculating module 503, configured to calculate a target bucket number and a single-node bucket concurrency number according to the estimated query peak memory, the number of queries being executed by the current system, and the number of nodes calculated by the current distributed computing system;
an adjusting module 504, configured to adjust a sub-bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number, to obtain a value range of each sub-bucket;
and the segmentation module 505 is configured to perform file segmentation according to the stored sub-bucket file information and the value range of each sub-bucket.
In a possible implementation, the calculating module 503 is specifically configured to: estimating a pre-estimated query peak memory M1 when the single node is executed; estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels; calculating the maximum available memory M5= MIN (M3, M4) of the single computing node according to the allowance M3 of the current Presto general memory pool and the maximum memory usage M4 of the single querying node; if M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2X K and M5 is less than M2X (K + 1); if M5< M2, an integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal split bucket number is B × B and the bucket concurrency number is 1.
In a possible implementation manner, the adjusting module 504 is specifically configured to: configuring a system self-contained session parameter current _ lifespans _ per _ task according to the target barrel number; and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
In a possible implementation, the pre-bucket module 501 is specifically configured to: and storing the barrel files in a columnar storage format with sparse indexes.
Based on the same technical concept, an embodiment of the present application further provides an apparatus, including: the device comprises a data acquisition device, a processor and a memory; the data acquisition device is used for acquiring data; the memory for storing one or more program instructions; the processor is configured to execute one or more program instructions to perform the method.
Based on the same technical concept, the embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium contains one or more program instructions, and the one or more program instructions are used for executing the method.
In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Reference is made to the description of the method embodiments in part.
It is noted that while the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not a requirement or suggestion that the operations must be performed in this particular order or that all of the illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
Although the present application provides method steps as in embodiments or flowcharts, additional or fewer steps may be included based on conventional or non-inventive approaches. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of sequences, and does not represent a unique order of performance. When implemented in practice, an apparatus or client product may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) in accordance with the embodiments or methods depicted in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded.
The units, devices, modules, etc. set forth in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the present application, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of a plurality of sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Those skilled in the art will also appreciate that, in addition to implementing the controller in purely computer readable program code means, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The above-mentioned embodiments are further described in detail for the purpose of illustrating the invention, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. The Presto-based dynamic bucket dividing method is characterized by comprising the following steps:
acquiring a data set, sequencing the data set according to a logic main key, pre-partitioning a bucket according to the value range of the logic main key and the number of the calculation nodes, and storing a bucket partitioning file; the sub-bucket files are sorted according to the logic main key;
determining an estimated query peak memory according to the data volume and the historical query record;
calculating the number of target barrels and the number of single-node barrel concurrence according to the estimated query peak value memory, the number of queries being executed by the current system and the number of calculation nodes of the current distributed calculation system;
adjusting the bucket granularity in the executing query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each bucket;
segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket;
the calculating the target barrel number and the concurrent number of the single-node barrels according to the pre-estimated query peak value memory, the query number currently executed by the current system and the calculation node number of the current distributed calculation system comprises the following steps:
estimating a pre-estimated query peak memory M1 when the single node is executed;
estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels;
calculating the maximum available memory M5= MIN (M3, M4) of the single computing node according to the allowance M3 of the current Presto general memory pool and the maximum memory usage M4 of the single querying node;
if M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2 x K and M5 is less than M2 x (K + 1);
if M5< M2, an integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal split bucket number is B × B and the bucket concurrency number is 1.
2. The method of claim 1, wherein the adjusting the bucket granularity in the executing query as a function of the target number of buckets and the single-node bucket concurrency number comprises:
configuring a system self-contained session parameter current _ lifespans _ per _ task according to the target barrel number;
and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
3. The method of claim 1, wherein the storing the bucketed file comprises:
and storing the barrel files in a columnar storage format with sparse indexes.
4. Presto-based dynamic barreling system, wherein the system comprises:
the pre-bucket-dividing module is used for acquiring a data set, sorting the data set according to a logic main key, pre-dividing buckets according to the value range of the logic main key and the number of the calculation nodes, and storing a bucket-dividing file; the sub-bucket files are sorted according to the logic main key;
the query peak memory estimation module is used for determining an estimated query peak memory according to the data volume and the historical query records;
the calculation module is used for calculating the target barrel number and the single-node barrel concurrency number according to the estimated query peak memory, the number of queries being executed by the current system and the number of calculation nodes of the current distributed calculation system;
the adjusting module is used for adjusting the sub-bucket granularity in the executed query according to the target bucket number and the single-node bucket concurrency number to obtain the value range of each sub-bucket;
the segmentation module is used for segmenting the file according to the stored sub-bucket file information and the value range of each sub-bucket;
the calculation module is specifically configured to:
estimating a pre-estimated query peak memory M1 when the single node is executed;
estimating a peak memory M2= M1/B required by running one barrel at a time according to the number B of pre-divided barrels;
calculating the maximum occupied memory M5= MIN (M3, M4) of a single computing node according to the allowance M3 of the current Presto general memory pool and the maximum memory usage M4 of the single querying node;
if M5 is greater than M2, the optimal bucket dividing number is a pre-bucket dividing number B, and bucket concurrency K is calculated, so that M5 is greater than M2 x K and M5 is less than M2 x (K + 1);
if M5< M2, an integer bucket splitting coefficient B is calculated such that M4/B > M2, and M5/(B-1) < M2, the optimal split bucket number is B × B and the bucket concurrency number is 1.
5. The system of claim 4, wherein the adjustment module is specifically configured to:
configuring a system self-contained session parameter current _ lifespans _ per _ task according to the target barrel number;
and configuring a self-defined session parameter dynamic _ bucket _ num according to the concurrency number of the single-node bucket.
6. The system of claim 4, wherein the pre-binning module is specifically configured to:
and storing the barrel files in a columnar storage format with sparse indexes.
7. An apparatus, characterized in that the apparatus comprises: the device comprises a data acquisition device, a processor and a memory;
the data acquisition device is used for acquiring data; the memory is to store one or more program instructions; the processor, configured to execute one or more program instructions to perform the method of any of claims 1-3.
8. A computer-readable storage medium having one or more program instructions embodied therein for performing the method of any of claims 1-3.
CN202011310738.XA 2020-11-20 2020-11-20 Presto-based dynamic barrel dividing method, system, equipment and readable storage medium Active CN112445776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310738.XA CN112445776B (en) 2020-11-20 2020-11-20 Presto-based dynamic barrel dividing method, system, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310738.XA CN112445776B (en) 2020-11-20 2020-11-20 Presto-based dynamic barrel dividing method, system, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112445776A CN112445776A (en) 2021-03-05
CN112445776B true CN112445776B (en) 2022-12-20

Family

ID=74737110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310738.XA Active CN112445776B (en) 2020-11-20 2020-11-20 Presto-based dynamic barrel dividing method, system, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112445776B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111033A (en) * 2021-04-07 2021-07-13 山东英信计算机技术有限公司 Method and system for dynamically redistributing bucket indexes in distributed object storage system
CN117390106B (en) * 2023-12-11 2024-03-12 杭州网易云音乐科技有限公司 Data processing method, device, storage medium and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810247A (en) * 2014-01-10 2014-05-21 国网信通亿力科技有限责任公司 Disaster recovery data comparing method based on bucket algorithm
CN103914462A (en) * 2012-12-31 2014-07-09 中国移动通信集团公司 Data storage and query method and device
CN108241539A (en) * 2018-01-03 2018-07-03 百度在线网络技术(北京)有限公司 Interactive big data querying method, device, storage medium and terminal device based on distributed system
CN110196858A (en) * 2019-06-05 2019-09-03 浪潮软件集团有限公司 A method of data update is carried out based on Hive Mutation API
CN111723089A (en) * 2019-03-21 2020-09-29 北京沃东天骏信息技术有限公司 Method and device for processing data based on columnar storage format

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11228489B2 (en) * 2018-01-23 2022-01-18 Qubole, Inc. System and methods for auto-tuning big data workloads on cloud platforms

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914462A (en) * 2012-12-31 2014-07-09 中国移动通信集团公司 Data storage and query method and device
CN103810247A (en) * 2014-01-10 2014-05-21 国网信通亿力科技有限责任公司 Disaster recovery data comparing method based on bucket algorithm
CN108241539A (en) * 2018-01-03 2018-07-03 百度在线网络技术(北京)有限公司 Interactive big data querying method, device, storage medium and terminal device based on distributed system
CN111723089A (en) * 2019-03-21 2020-09-29 北京沃东天骏信息技术有限公司 Method and device for processing data based on columnar storage format
CN110196858A (en) * 2019-06-05 2019-09-03 浪潮软件集团有限公司 A method of data update is carried out based on Hive Mutation API

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HashMap优化及其在列存储数据库查询中的应用;母红芬等;《计算机科学与探索》;20151029(第09期);全文 *
Presto SQL on Everything;Raghav Sethi等;《2019 IEEE 35th International Conference on Data Engineering (ICDE)》;20190606;全文 *
基于T-Gloseness的大数据脱敏系统的设计与实现;邵华西;《中国优秀硕士学位论文全文数据库》;20190815;第38-44页 *
结构化大数据存储与查询优化关键技术;徐涛;《中国优秀博士论文全文数据库》;20180515;第33-64页 *

Also Published As

Publication number Publication date
CN112445776A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
US10771538B2 (en) Automated ETL resource provisioner
Konstantinou et al. On the elasticity of NoSQL databases over cloud management platforms
US9721007B2 (en) Parallel data sorting
CN102968498A (en) Method and device for processing data
CN105279276A (en) Database index optimization system
CN107391502B (en) Time interval data query method and device and index construction method and device
US20140351239A1 (en) Hardware acceleration for query operators
US20190065546A1 (en) Multi stage aggregation using digest order after a first stage of aggregation
CN112445776B (en) Presto-based dynamic barrel dividing method, system, equipment and readable storage medium
Elsayed et al. Mapreduce: State-of-the-art and research directions
Wickremesinghe et al. Distributed computing with load-managed active storage
US20150286748A1 (en) Data Transformation System and Method
CN104111936A (en) Method and system for querying data
CN112148693A (en) Data processing method, device and storage medium
Tang et al. An intermediate data partition algorithm for skew mitigation in spark computing environment
Silberstein et al. Efficient bulk insertion into a distributed ordered table
CN103036697B (en) Multi-dimensional data duplicate removal method and system
CN108628898A (en) The method, apparatus and equipment of data loading
Zhi et al. Research of Hadoop-based data flow management system
CN114327857A (en) Operation data processing method and device, computer equipment and storage medium
CN111949681A (en) Data aggregation processing device and method and storage medium
Harsh et al. Histogram sort with sampling
Vu et al. R*-grove: Balanced spatial partitioning for large-scale datasets
CN107679133B (en) Mining method applicable to massive real-time PMU data
Liroz-Gistau et al. Dynamic workload-based partitioning algorithms for continuously growing databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 305, 3rd Floor, Building 25, No. 10 Jiuxianqiao Road, Chaoyang District, Beijing, 100016

Patentee after: Beijing Analysys Digital Intelligence Technology Co.,Ltd.

Address before: 100020 Room 305, 3rd floor, building 25, 10 Jiuxianqiao Road, Chaoyang District, Beijing

Patentee before: BEIJING ANALYSYS THINK TANK NETWORK TECHNOLOGY Co.,Ltd.