CN117648341A - Method and system for quickly assembling data based on disk memory in limited resources - Google Patents

Method and system for quickly assembling data based on disk memory in limited resources Download PDF

Info

Publication number
CN117648341A
CN117648341A CN202311560295.3A CN202311560295A CN117648341A CN 117648341 A CN117648341 A CN 117648341A CN 202311560295 A CN202311560295 A CN 202311560295A CN 117648341 A CN117648341 A CN 117648341A
Authority
CN
China
Prior art keywords
sql
data
memory
data source
disk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311560295.3A
Other languages
Chinese (zh)
Other versions
CN117648341B (en
Inventor
赵蒙海
陈海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jinshida Weining Software Technology Co ltd
Original Assignee
Shanghai Jinshida Weining Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jinshida Weining Software Technology Co ltd filed Critical Shanghai Jinshida Weining Software Technology Co ltd
Priority to CN202311560295.3A priority Critical patent/CN117648341B/en
Priority claimed from CN202311560295.3A external-priority patent/CN117648341B/en
Publication of CN117648341A publication Critical patent/CN117648341A/en
Application granted granted Critical
Publication of CN117648341B publication Critical patent/CN117648341B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for quickly assembling data in limited resources based on disk memory, wherein the method comprises the following steps: step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan; step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source; step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized. According to the invention, by means of the memory and disk hash join and disk overflow writing mode, a large amount of data is more stable in single-row aggregation, the limit of the single-row size on different heavy running is avoided, and time and resources are wasted.

Description

Method and system for quickly assembling data based on disk memory in limited resources
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a system for quickly assembling data based on disk memory in limited resources.
Background
Some clients have limited resources and cannot provide large data-related environments and resources, but have a need to analyze the volume of data in the billions. In order to meet the requirements of storing a large amount of data and stably assembling and analyzing the large amount of data on the premise of limited resources (only one or two servers of 4c8g and 8c16 g), the invention provides a method and a system for rapidly assembling the data on the basis of disk memory in limited resources.
In the prior art, for example, the analytical databases, maxcomputer, hive, require a lot of resources and are not very capital intensive. While also having respective drawbacks for analyzing large amounts of data. When the Alicloud Maxcomputer performs group by operation, when one column is aggregated, the default 8MB is exceeded, tasks are wrongly reported, and only data exceeding 8MB and data not exceeding 8MB can be processed separately. Hive uses concat_ws for aggregation, and the resulting size is limited to the maximum length of the varchar type, which may lead to performance degradation or other problems for Hive clusters.
Patent document CN112860730a (application number 202110336670.0) discloses a method and apparatus for processing an SQL statement, an electronic device, and a readable storage medium. The method comprises the following steps: when receiving an SQL sentence through a preset SQL channel, generating a query execution plan based on the SQL sentence; creating a plurality of query tasks executable in parallel based on a query execution plan; and processing each inquiry task respectively.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for quickly assembling data based on disk memory in limited resources.
The method for quickly assembling data based on disk memory in limited resources provided by the invention comprises the following steps:
step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
Preferably, the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine which transmits dynamic parameters during execution.
Preferably, the step S1 employs: an sql engine generated based on Antlr 4; and performing lexical analysis, grammar analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan.
Preferably, the nested queries are rewritten as join queries based on the parsing.
Preferably, the step S2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; and converting the read data source into a table structure object and registering the table structure object as a temporary memory table.
Preferably, the step S3 employs: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
The invention provides a system for quickly assembling data on the basis of disk memory in limited resources, which comprises the following components:
module M1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
module M2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
module M3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
Preferably, the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine which transmits dynamic parameters during execution.
Preferably, the module M1 employs: an sql engine generated based on Antlr 4; performing lexical analysis, grammatical analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan;
the nested queries are rewritten as join queries based on the parsing.
Preferably, the module M2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; converting the read data source into a table structure object and registering the table structure object as a memory temporary table;
the module M3 employs: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention uses SQL to prepare and analyze data, which can realize stable and stable analysis of a large amount of data under the premise of limited resources (only one two servers of 4c8g and 8c16 g);
2. the invention realizes the technical effect of rapidly processing a large amount of data by adding the disk into the memory for stream processing;
3. the invention realizes the technical effect of sql management data processing through the grammar parser generated by Antlr 4;
4. according to the invention, the local disk of a single machine is used as data exchange and temporary storage, so that the cost is greatly reduced;
5. according to the invention, by means of adding a disk hash join into the memory and simultaneously overflowing the disk, a large amount of data is more stably aggregated in a single row, the limit of the size of the single row on different heavy running is avoided, and time and resources are wasted;
6. the invention describes the data analysis service through the sql statement, thereby reducing the learning cost.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a flow chart for managing SQL statements using a template engine.
FIG. 2 is a flow chart of converting an parse and optimize sql statement into a physical execution plan.
FIG. 3 is a flow chart for partitioning a preprocessed data source, and concurrently executing multiple partitions.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1
The method for quickly assembling data based on disk memory in limited resources provided by the invention comprises the following steps:
step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
Specifically, as shown in fig. 1, the fixed service manages the service sql through the template engine, and when the fixed service is executed, the template engine transmits dynamic parameters to obtain executable sql, and the executable sql is transmitted to the sql engine to be executed.
Specifically, as shown in fig. 2, the step S1 employs: an sql engine generated based on antlr4 analyzes and optimizes the sql;
more specifically, an sql engine generated based on Antlr4 performs lexical analysis, syntax analysis, sql optimization on executed sql statements, and generates a physical execution plan; wherein complex nested queries are rewritten as more efficient join queries based on parsing.
Specifically, the step S2 employs: and according to the data sources which are transmitted in when the sql is executed, adopting readers adapted to each data source (database, text file, csv and the like) to read the data, converting the read data into an internal table structure object and registering the internal table structure object as a temporary memory table, and executing a physical plan, wherein the data source of each sql operator is derived from the table.
Specifically, as shown in fig. 3, the step S3 employs: aiming at the optimization of the problem of execution failure caused by a single column being more than 8MB or a certain size when the common cloud platform big data component, the open source hive and the spark are used for processing packet aggregation:
partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
The table data line is partitioned according to the grouping field, and each partition is independently processed. The single partition file contains only one full amount of data of a group, and no ordering is required. And writing the result file according to the sequence of the output columns, wherein the fixed cache case temporary aggregation data are preset when the aggregation columns are aggregated, and the result file is immediately written when overflow occurs.
Calculating the number of partitions: select count (1) from (select 1 from [ table name ] group by [ partition field ]) tb;
partition logic: hash (partition field)% partition number;
polymerization: and acquiring the result according to the field sequence after selection and writing the result into a result file.
And setting a fixed cache size S in the aggregation column, if the same grouping aggregation exceeds S without finishing, immediately writing a result file, and continuing the same operation until the overflow or the aggregation is finished again.
Example 2
The invention provides a system for quickly assembling data on the basis of disk memory in limited resources, which comprises the following components:
module M1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
module M2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
module M3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
Specifically, as shown in fig. 1, the fixed service manages the service sql through the template engine, and when the fixed service is executed, the template engine transmits dynamic parameters to obtain executable sql, and the executable sql is transmitted to the sql engine to be executed.
Specifically, as shown in fig. 2, the module M1 employs: an sql engine generated based on antlr4 analyzes and optimizes the sql;
more specifically, an sql engine generated based on Antlr4 performs lexical analysis, syntax analysis, sql optimization on executed sql statements, and generates a physical execution plan; wherein complex nested queries are rewritten as more efficient join queries based on parsing.
Specifically, the module M2 employs: and according to the data sources which are transmitted in when the sql is executed, adopting readers adapted to each data source (database, text file, csv and the like) to read the data, converting the read data into an internal table structure object and registering the internal table structure object as a temporary memory table, and executing a physical plan, wherein the data source of each sql operator is derived from the table.
Specifically, as shown in fig. 3, the module M3 employs: aiming at the optimization of the problem of execution failure caused by a single column being more than 8MB or a certain size when the common cloud platform big data component, the open source hive and the spark are used for processing packet aggregation:
partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
The table data line is partitioned according to the grouping field, and each partition is independently processed. The single partition file contains only one full amount of data of a group, and no ordering is required. And writing the result file according to the sequence of the output columns, wherein the fixed cache case temporary aggregation data are preset when the aggregation columns are aggregated, and the result file is immediately written when overflow occurs.
Calculating the number of partitions: select count (1) from (select 1 from [ table name ] group by [ partition field ]) tb;
partition logic: hash (partition field)% partition number;
polymerization: and acquiring the result according to the field sequence after selection and writing the result into a result file.
And setting a fixed cache size S in the aggregation column, if the same grouping aggregation exceeds S without finishing, immediately writing a result file, and continuing the same operation until the overflow or the aggregation is finished again.
Example 3
Example 3 is a preferred example of example 1/example 2
Managing SQL statements using a template engine
Acquiring an sql file proxy through JDK reflection, an sql xml file and an interface corresponding to the sql xml file
Proxy, newproxyInstance (classLoader, new Class [ ] { interface Class object corresponding to sql xml file },
new InvocationHandler (sql xml file resource path));
the invoke method of the InvotionHandler is implemented: the class loader acquires the input stream of the sql xml, reads the input stream to construct the org.w3c.dom.document object, takes out the corresponding sql statement from the Document object through the interface name, and completes the dynamic parameter replacement by placeholder replacement.
1. Achieving sql resolution using Antlr4
And writing a g4 grammar file of the Hive grammar for lexical analysis and grammar analysis of the sql statement to generate AST.
g4 grammar file take simple implementation of select query as an example:
generation of sql parser base code using Antlr4 tools or Maven plug-ins
Antrl4 tool:
antlr4-package org.example.calc-listener-visitor.\Sql.g4
maven plug-in:
to optimize the SQL statement, the parse tree needs to be traversed and processed. Implementation can be achieved using the ParseTreeWalker class provided by ANTLR and a custom ParseTreeListener implementation: in ParseTreeListener, various SQL optimization techniques, such as predicate pushdown, projection column clipping, connection order optimization, etc., can be applied
Packet aggregation
The grouping algorithm adopts a conventional shuffle algorithm, writes the same group of data hash into the same local file, and traverses each group of data for aggregation.
When grouping is aggregated, such as a concat_ws function, the same group of data volume is too large, which can cause that a column of data size is uncontrollable, and a column of maximum size is controlled by a general cloud platform through global parameters, such as an alicloud maxcompter column of aggregate size default 8MB.
Here each aggregate column is isolated by a file, and the aggregate size is limited during aggregation using a temporary buffer, and immediately overflows the disk by means of an application beyond the size. After the same partition is processed, a row of data is generated according to the group by select field sequence.
Those skilled in the art will appreciate that the invention provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can also be regarded as structures in the hardware component; means, modules, and units for implementing the various functions may also be considered as either software modules for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method for rapidly assembling data based on disk memory in a limited resource, comprising:
step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
2. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine by inputting dynamic parameters during execution.
3. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S1 uses: an sql engine generated based on Antlr 4; and performing lexical analysis, grammar analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan.
4. The method for quickly assembling data on a limited resource based on disk memory according to claim 1, wherein nested queries are rewritten as join queries based on parsing.
5. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S2 uses: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; and converting the read data source into a table structure object and registering the table structure object as a temporary memory table.
6. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S3 uses: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
7. A system for rapidly assembling data based on disk memory in a limited resource, comprising:
module M1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;
module M2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;
module M3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.
8. The system for quickly assembling data on a limited resource based on disk memory according to claim 7, wherein the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine by inputting dynamic parameters during execution.
9. The system for fast assembly of data in a limited resource based on disk memory of claim 7, wherein said module M1 employs: an sql engine generated based on Antlr 4; performing lexical analysis, grammatical analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan;
the nested queries are rewritten as join queries based on the parsing.
10. The system for fast assembly of data in a limited resource based on disk memory of claim 7, wherein said module M2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; converting the read data source into a table structure object and registering the table structure object as a memory temporary table;
the module M3 employs: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.
CN202311560295.3A 2023-11-21 Method and system for quickly assembling data based on disk memory in limited resources Active CN117648341B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311560295.3A CN117648341B (en) 2023-11-21 Method and system for quickly assembling data based on disk memory in limited resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311560295.3A CN117648341B (en) 2023-11-21 Method and system for quickly assembling data based on disk memory in limited resources

Publications (2)

Publication Number Publication Date
CN117648341A true CN117648341A (en) 2024-03-05
CN117648341B CN117648341B (en) 2024-07-09

Family

ID=

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239612A1 (en) * 2011-01-25 2012-09-20 Muthian George User defined functions for data loading
CN107818181A (en) * 2017-11-27 2018-03-20 深圳市华成峰科技有限公司 Indexing means and its system based on Plcient interactive mode engines
US20180218044A1 (en) * 2017-01-31 2018-08-02 Salesforce.Com, Inc. Systems, methods, and apparatuses for implementing a by partition command term within a multi-tenant aware structured query language
CN110704479A (en) * 2019-09-12 2020-01-17 新华三大数据技术有限公司 Task processing method and device, electronic equipment and storage medium
CN112860730A (en) * 2021-03-29 2021-05-28 中信银行股份有限公司 SQL statement processing method and device, electronic equipment and readable storage medium
CN114756629A (en) * 2022-06-16 2022-07-15 之江实验室 Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN116756150A (en) * 2023-08-16 2023-09-15 浩鲸云计算科技股份有限公司 Mpp database large table association acceleration method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120239612A1 (en) * 2011-01-25 2012-09-20 Muthian George User defined functions for data loading
US20180218044A1 (en) * 2017-01-31 2018-08-02 Salesforce.Com, Inc. Systems, methods, and apparatuses for implementing a by partition command term within a multi-tenant aware structured query language
CN107818181A (en) * 2017-11-27 2018-03-20 深圳市华成峰科技有限公司 Indexing means and its system based on Plcient interactive mode engines
CN110704479A (en) * 2019-09-12 2020-01-17 新华三大数据技术有限公司 Task processing method and device, electronic equipment and storage medium
CN112860730A (en) * 2021-03-29 2021-05-28 中信银行股份有限公司 SQL statement processing method and device, electronic equipment and readable storage medium
CN114756629A (en) * 2022-06-16 2022-07-15 之江实验室 Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN116756150A (en) * 2023-08-16 2023-09-15 浩鲸云计算科技股份有限公司 Mpp database large table association acceleration method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
000X000: ""HiveSQL原理和优化详解"", pages 1 - 18, Retrieved from the Internet <URL:https://blog.csdn.net/ytp552200ytp/article/details/124151930> *

Similar Documents

Publication Publication Date Title
US11481253B2 (en) Managing the processing of streamed data in a data streaming application using query information from a relational database
US20170083573A1 (en) Multi-query optimization
Lee et al. Parallel data processing with MapReduce: a survey
JP5298117B2 (en) Data merging in distributed computing
CN108681569B (en) Automatic data analysis system and method thereof
Bruno et al. Advanced join strategies for large-scale distributed computation
EP2876562A1 (en) Data statistics in data management systems
US11580147B2 (en) Conversational database analysis
US20210124739A1 (en) Query Processing with Machine Learning
JP2016509294A (en) System and method for a distributed database query engine
CN110209668B (en) Dimension table association method, device and equipment based on stream calculation and readable storage medium
Elsayed et al. Mapreduce: State-of-the-art and research directions
Zhang et al. Hardware-conscious stream processing: A survey
CN112860730A (en) SQL statement processing method and device, electronic equipment and readable storage medium
Sinthong et al. Aframe: Extending dataframes for large-scale modern data analysis
Bidoit et al. Processing XML queries and updates on map/reduce clusters
Dann et al. Non-relational databases on FPGAs: Survey, design decisions, challenges
Hassan et al. S3QLRDF: distributed SPARQL query processing using Apache Spark—a comparative performance study
Prammer et al. Introducing a Query Acceleration Path for Analytics in SQLite3
CN117648341B (en) Method and system for quickly assembling data based on disk memory in limited resources
CN106843822B (en) Execution code generation method and equipment
CN117349368A (en) Cross-database data real-time synchronous task management system and method based on Flink
Zhang et al. Design and implementation of a real-time interactive analytics system for large spatio-temporal data
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)
CN117648341A (en) Method and system for quickly assembling data based on disk memory in limited resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant