CN117648341A

CN117648341A - Method and system for quickly assembling data based on disk memory in limited resources

Info

Publication number: CN117648341A
Application number: CN202311560295.3A
Authority: CN
Inventors: 赵蒙海; 陈海涛
Original assignee: Shanghai Jinshida Weining Software Technology Co ltd
Current assignee: Shanghai Jinshida Weining Software Technology Co ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-03-05
Anticipated expiration: 2043-11-21

Abstract

The invention provides a method and a system for quickly assembling data in limited resources based on disk memory, wherein the method comprises the following steps: step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan; step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source; step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized. According to the invention, by means of the memory and disk hash join and disk overflow writing mode, a large amount of data is more stable in single-row aggregation, the limit of the single-row size on different heavy running is avoided, and time and resources are wasted.

Description

Method and system for quickly assembling data based on disk memory in limited resources

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for quickly assembling data based on disk memory in limited resources.

Background

Some clients have limited resources and cannot provide large data-related environments and resources, but have a need to analyze the volume of data in the billions. In order to meet the requirements of storing a large amount of data and stably assembling and analyzing the large amount of data on the premise of limited resources (only one or two servers of 4c8g and 8c16 g), the invention provides a method and a system for rapidly assembling the data on the basis of disk memory in limited resources.

In the prior art, for example, the analytical databases, maxcomputer, hive, require a lot of resources and are not very capital intensive. While also having respective drawbacks for analyzing large amounts of data. When the Alicloud Maxcomputer performs group by operation, when one column is aggregated, the default 8MB is exceeded, tasks are wrongly reported, and only data exceeding 8MB and data not exceeding 8MB can be processed separately. Hive uses concat_ws for aggregation, and the resulting size is limited to the maximum length of the varchar type, which may lead to performance degradation or other problems for Hive clusters.

Patent document CN112860730a (application number 202110336670.0) discloses a method and apparatus for processing an SQL statement, an electronic device, and a readable storage medium. The method comprises the following steps: when receiving an SQL sentence through a preset SQL channel, generating a query execution plan based on the SQL sentence; creating a plurality of query tasks executable in parallel based on a query execution plan; and processing each inquiry task respectively.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for quickly assembling data based on disk memory in limited resources.

The method for quickly assembling data based on disk memory in limited resources provided by the invention comprises the following steps:

step S1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;

step S2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;

step S3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.

Preferably, the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine which transmits dynamic parameters during execution.

Preferably, the step S1 employs: an sql engine generated based on Antlr 4; and performing lexical analysis, grammar analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan.

Preferably, the nested queries are rewritten as join queries based on the parsing.

Preferably, the step S2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; and converting the read data source into a table structure object and registering the table structure object as a temporary memory table.

Preferably, the step S3 employs: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.

The invention provides a system for quickly assembling data on the basis of disk memory in limited resources, which comprises the following components:

module M1: the sql engine analyzes and optimizes the executed sql statement to generate a physical execution plan;

module M2: reading an incoming data source when the sql is executed, and preprocessing the incoming data source;

module M3: the physical execution plan partitions the preprocessed data source, so that multi-partition concurrent execution is realized.

Preferably, the module M1 employs: an sql engine generated based on Antlr 4; performing lexical analysis, grammatical analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan;

the nested queries are rewritten as join queries based on the parsing.

Preferably, the module M2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; converting the read data source into a table structure object and registering the table structure object as a memory temporary table;

the module M3 employs: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention uses SQL to prepare and analyze data, which can realize stable and stable analysis of a large amount of data under the premise of limited resources (only one two servers of 4c8g and 8c16 g);

2. the invention realizes the technical effect of rapidly processing a large amount of data by adding the disk into the memory for stream processing;

3. the invention realizes the technical effect of sql management data processing through the grammar parser generated by Antlr 4;

4. according to the invention, the local disk of a single machine is used as data exchange and temporary storage, so that the cost is greatly reduced;

5. according to the invention, by means of adding a disk hash join into the memory and simultaneously overflowing the disk, a large amount of data is more stably aggregated in a single row, the limit of the size of the single row on different heavy running is avoided, and time and resources are wasted;

6. the invention describes the data analysis service through the sql statement, thereby reducing the learning cost.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flow chart for managing SQL statements using a template engine.

FIG. 2 is a flow chart of converting an parse and optimize sql statement into a physical execution plan.

FIG. 3 is a flow chart for partitioning a preprocessed data source, and concurrently executing multiple partitions.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1

Specifically, as shown in fig. 1, the fixed service manages the service sql through the template engine, and when the fixed service is executed, the template engine transmits dynamic parameters to obtain executable sql, and the executable sql is transmitted to the sql engine to be executed.

Specifically, as shown in fig. 2, the step S1 employs: an sql engine generated based on antlr4 analyzes and optimizes the sql;

more specifically, an sql engine generated based on Antlr4 performs lexical analysis, syntax analysis, sql optimization on executed sql statements, and generates a physical execution plan; wherein complex nested queries are rewritten as more efficient join queries based on parsing.

Specifically, the step S2 employs: and according to the data sources which are transmitted in when the sql is executed, adopting readers adapted to each data source (database, text file, csv and the like) to read the data, converting the read data into an internal table structure object and registering the internal table structure object as a temporary memory table, and executing a physical plan, wherein the data source of each sql operator is derived from the table.

Specifically, as shown in fig. 3, the step S3 employs: aiming at the optimization of the problem of execution failure caused by a single column being more than 8MB or a certain size when the common cloud platform big data component, the open source hive and the spark are used for processing packet aggregation:

partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.

The table data line is partitioned according to the grouping field, and each partition is independently processed. The single partition file contains only one full amount of data of a group, and no ordering is required. And writing the result file according to the sequence of the output columns, wherein the fixed cache case temporary aggregation data are preset when the aggregation columns are aggregated, and the result file is immediately written when overflow occurs.

Calculating the number of partitions: select count (1) from (select 1 from [ table name ] group by [ partition field ]) tb;

partition logic: hash (partition field)% partition number;

polymerization: and acquiring the result according to the field sequence after selection and writing the result into a result file.

And setting a fixed cache size S in the aggregation column, if the same grouping aggregation exceeds S without finishing, immediately writing a result file, and continuing the same operation until the overflow or the aggregation is finished again.

Example 2

Specifically, as shown in fig. 2, the module M1 employs: an sql engine generated based on antlr4 analyzes and optimizes the sql;

Specifically, the module M2 employs: and according to the data sources which are transmitted in when the sql is executed, adopting readers adapted to each data source (database, text file, csv and the like) to read the data, converting the read data into an internal table structure object and registering the internal table structure object as a temporary memory table, and executing a physical plan, wherein the data source of each sql operator is derived from the table.

Specifically, as shown in fig. 3, the module M3 employs: aiming at the optimization of the problem of execution failure caused by a single column being more than 8MB or a certain size when the common cloud platform big data component, the open source hive and the spark are used for processing packet aggregation:

partition logic: hash (partition field)% partition number;

Example 3

Example 3 is a preferred example of example 1/example 2

Managing SQL statements using a template engine

Acquiring an sql file proxy through JDK reflection, an sql xml file and an interface corresponding to the sql xml file

Proxy, newproxyInstance (classLoader, new Class [ ] { interface Class object corresponding to sql xml file },

new InvocationHandler (sql xml file resource path));

the invoke method of the InvotionHandler is implemented: the class loader acquires the input stream of the sql xml, reads the input stream to construct the org.w3c.dom.document object, takes out the corresponding sql statement from the Document object through the interface name, and completes the dynamic parameter replacement by placeholder replacement.

1. Achieving sql resolution using Antlr4

And writing a g4 grammar file of the Hive grammar for lexical analysis and grammar analysis of the sql statement to generate AST.

g4 grammar file take simple implementation of select query as an example:

generation of sql parser base code using Antlr4 tools or Maven plug-ins

Antrl4 tool:

antlr4-package org.example.calc-listener-visitor.\Sql.g4

maven plug-in:

to optimize the SQL statement, the parse tree needs to be traversed and processed. Implementation can be achieved using the ParseTreeWalker class provided by ANTLR and a custom ParseTreeListener implementation: in ParseTreeListener, various SQL optimization techniques, such as predicate pushdown, projection column clipping, connection order optimization, etc., can be applied

Packet aggregation

The grouping algorithm adopts a conventional shuffle algorithm, writes the same group of data hash into the same local file, and traverses each group of data for aggregation.

When grouping is aggregated, such as a concat_ws function, the same group of data volume is too large, which can cause that a column of data size is uncontrollable, and a column of maximum size is controlled by a general cloud platform through global parameters, such as an alicloud maxcompter column of aggregate size default 8MB.

Here each aggregate column is isolated by a file, and the aggregate size is limited during aggregation using a temporary buffer, and immediately overflows the disk by means of an application beyond the size. After the same partition is processed, a row of data is generated according to the group by select field sequence.

Those skilled in the art will appreciate that the invention provides a system and its individual devices, modules, units, etc. that can be implemented entirely by logic programming of method steps, in addition to being implemented as pure computer readable program code, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units for realizing various functions included in the system can also be regarded as structures in the hardware component; means, modules, and units for implementing the various functions may also be considered as either software modules for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A method for rapidly assembling data based on disk memory in a limited resource, comprising:

2. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine by inputting dynamic parameters during execution.

3. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S1 uses: an sql engine generated based on Antlr 4; and performing lexical analysis, grammar analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan.

4. The method for quickly assembling data on a limited resource based on disk memory according to claim 1, wherein nested queries are rewritten as join queries based on parsing.

5. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S2 uses: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; and converting the read data source into a table structure object and registering the table structure object as a temporary memory table.

6. The method for quickly assembling data on a limited resource based on a disk memory according to claim 1, wherein the step S3 uses: partitioning the table data line of each sql operator participating in calculation according to the grouping field, partitioning into a plurality of small sets which can be completely loaded into the memory for parallel calculation, caching the calculation result in the memory, and overflowing and writing the calculation result into the disk when the memory is insufficient.

7. A system for rapidly assembling data based on disk memory in a limited resource, comprising:

8. The system for quickly assembling data on a limited resource based on disk memory according to claim 7, wherein the business logic is abstracted into sql statements, the sql statements are managed by using a template engine, and the executable sql is obtained by the template engine by inputting dynamic parameters during execution.

9. The system for fast assembly of data in a limited resource based on disk memory of claim 7, wherein said module M1 employs: an sql engine generated based on Antlr 4; performing lexical analysis, grammatical analysis and sql optimization on the executed sql statement by using the generated sql engine to generate a physical execution plan;

the nested queries are rewritten as join queries based on the parsing.

10. The system for fast assembly of data in a limited resource based on disk memory of claim 7, wherein said module M2 employs: according to the data sources transmitted during execution of the sql, adopting readers adapted to each data source to read data; converting the read data source into a table structure object and registering the table structure object as a memory temporary table;