CN117668050A

CN117668050A - Cross-data-source hybrid engine query method, system, equipment and medium

Info

Publication number: CN117668050A
Application number: CN202311698474.3A
Authority: CN
Inventors: 徐彬彬; 贾俊龙; 卢旭
Original assignee: Inspur Software Technology Co Ltd
Current assignee: Inspur Software Technology Co Ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-08

Abstract

The invention discloses a cross-data source hybrid engine query method, a system, equipment and a medium, which belong to the technical field of big data analysis, and the technical problem to be solved by the invention is how to realize providing unified query sql to the outside, shielding the details of the bottom data source, and flexibly supporting a plurality of computing engines, and the adopted technical scheme is as follows: the method comprises the following steps: s1, establishing a complete database metadata management layer; s2, combing the calculation engine to finish engine initialization mapping; s3, analyzing the sql, and selecting a calculation engine to execute. The system comprises a data source layer, a metadata management layer, a hybrid engine layer, an analysis layer and a user interface layer, wherein a set of complete metadata management layer is established on the basis of the bottom data source layer, multiple calculation engine supports are provided through the hybrid engine layer, the analysis layer analyzes unified sql input by a user, the engine calculation execution sql is selected, and the user interface layer provides functions of user sql input and query result return.

Description

Cross-data-source hybrid engine query method, system, equipment and medium

Technical Field

The invention relates to the technical field of big data analysis, in particular to a cross-data source hybrid engine query method, a system, equipment and a medium.

Background

With the development of big data industry, the data storage engine and the calculation engine are all in line, and new technology is layered endlessly. Such as a relational database, a non-relational database, a database based on hadoop system, and a database in NewSQL direction. The computing engine is more mainstream and has spark and flink, and both have thousands of years and are also rapidly developed. These storage engines and computing engines have very many schemes, and developers spend very much cost learning the techniques, trying various schemes to cope with market demands constantly, often making it difficult for the developers to choose to get tangled.

Therefore, how to provide unified query sql to the outside, shield the details of the underlying data source, and flexibly support multiple computing engines is a technical problem to be solved at present.

Disclosure of Invention

The technical task of the invention is to provide a cross-data source hybrid engine query method, a system, equipment and a medium, which are used for solving the problem of how to realize providing unified query sql to the outside, shielding the details of the underlying data sources and flexibly supporting various computing engines.

The technical task of the invention is realized in the following way, namely a cross-data source hybrid engine query method based on Apache Calcite, which comprises the following steps:

s1, establishing a complete database metadata management layer: the database metadata management comprises data source connection information, database keywords, field types and functions, and expands the data source;

s2, combing the calculation engine to finish engine initialization mapping: the computing engine comprises spark and flink, and the original engine is not modified;

s3, analyzing sql, and selecting a computing engine to execute: the computation engine can execute the sql statement by utilizing calcite to analyze the sql statement, execute the sql statement and return a result.

Preferably, the establishing a complete database metadata management layer in step S1 is specifically as follows:

s101, establishing unified data source information, and setting different data source link parameters according to data source classification;

s102, establishing unified database metadata;

s103, reading all table information in the data source according to different data sources, and loading the table information into a memory.

More preferably, the step S101 of establishing unified data source information and setting different data source link parameters according to the data source classification is specifically as follows:

s10101, carding a converged data source, wherein the method comprises the following steps of:

s10102, combing the database connection mode information of the main stream, specifically: the relational database comprises Oracle, mySQL and SQL Server; the non-relational database includes Redis, ES, mongoDB; the database based on the Hadoop system comprises Hive and HBase; the NewSQL direction database includes ClickHouse, GP;

s10103, combing different types of data source connection parameter information, namely SET_DS_ORGIN, converging to form a unified data source connection pool, namely SET SET_DS_META, wherein the SET SET_DS_META supports flexible expansion;

s10104, mapping the original data source and the unified data source is completed: mapping of the original data source SET SET_DS_ORGIN and the unified data source SET SET_DS_META is realized and is recorded as MAP_DS_ORGIN_META;

the unified database metadata is established in step S102 specifically as follows:

s10201, carding database metadata, specifically: database KEYWORDs in different types of data sources are combed and marked as SET_KEYWORGIN; converging to form unified database KEYWORDs, and recording the unified database KEYWORDs as a set_KEYWORD_META; the field types in different types of data sources are combed and are marked as SET_ FIELDTYPE _ORGIN; converging to form a unified field type, which is recorded as a set_ FIELDTYPE _META; the functions in different types of data sources are combed, and a unified database function SET_FUN_ORGIN is formed through aggregation and is recorded as a SET SET_FUN_META;

s10202, finishing mapping relation between unified metadata of a database and an original database: in the carding process, the mapping relation of keywords, field types and functions in the original data source SET SET_DS_ORGIN and corresponding information in SET_DS_META is respectively realized, and the mapping relation is integrally recorded as mapping MAP_METADA_ORGIN_META;

in step S103, reading all table information in the data source according to the data source, and loading the table information into the memory, specifically as follows:

reading database connection information in a data source SET SET_DS_ORGIN, connecting an original database, reading all TABLE information in the original database, wherein all TABLE information in the original database comprises TABLE names, field types and length information, and recording the TABLE information as SET_TABLE_ORGIN; generating TABLE SET information SET_TABLE_META in the unified data source according to the mapping relation MAP_METATA_ORGIN_META; to prevent duplicate table names in different databases, adding a data source identifier to each data source in the data source SET SET_DS, denoted as flg_ds; renaming the data TABLE when the set_TABLE_ORGIN is mapped to the set_TABLE_META according to the MAP_ORGIN_META, and adding a data source identification flag_ds as a prefix; the mapped data TABLE information set_TABLE_META is loaded into the memory, so that the fast extraction is convenient.

Preferably, the comb computing engine in step S2 performs the engine initialization mapping specifically as follows:

s201, a data source SET in temporary table information is declared in a main stream calculation ENGINE and is recorded as SET_DS_ENGINE;

s202, combing key words, field types and functions in a main stream calculation ENGINE, and respectively recording the key words, the field types and the functions as SET_KEYWORD_ENGINE, SET_ FIELDTYPE _ENGINE and SET_FUN_ENGINE;

s203, in the carding process, respectively realizing the mapping relation between the original data source set_DS_ORGIN and the data source SET set_DS_ENGINE SET in the calculation ENGINE, and recording the mapping relation as mapping MAP_DS_ORGIN_ENGINE; simultaneously completing the mapping relation of the key words, the field types and the functions in the SET_DS_ORGIN and the corresponding information in the SET_DS_ENGIN, and integrally marking the mapping relation as mapping MAP_METATE_ORGIN_ENGINE;

s204, cross-comparing the MAP_DS_ORGIN_META with the MAP_DS_ORGIN_ENGINE to obtain unified metadata and data source information mapping MAP_DS_META_ENGINE in the calculation ENGINE; and according to the cross comparison of MAP_METACAT_ORGIN_META and MAP_METACAT_ORGIN_ENGINE, mapping MAP_METATACAT_METAA_ENGINE with METADATA information in the calculation ENGINE.

Preferably, the parsing sql in step S3 is performed by the selection calculation engine as follows:

s301, acquiring various calculation engine selection strategies;

s302, selecting a computing engine to execute: analyzing the query sql input by the user, analyzing the sql input by the user into an abstract syntax tree, checking the abstract syntax tree according to a metadata management layer, converting the abstract syntax tree into a relational algebra tree after the checking is finished, acquiring the sql intention through the relational algebra tree, and selecting an engine for execution; wherein, the query SQL input by the USER is denoted as user_sql_meta; the content of the metadata verification includes verifying tables, fields and functions in the metadata, and verifying the validity of the tables, fields and functions in the metadata.

More preferably, the calculation engine selection policy in step S301 specifically includes the following cases:

(1) the random selection mode is as follows: each time sql is executed, a calculation engine is randomly selected to execute, and a result is returned;

(2) polling mode: according to the order of the computing engines configured by the user, selecting the next engine to execute according to the order of the computing engines configured by the user when the sql is executed each time, and returning a result;

(3) the weight mode is as follows: according to the calculation engine weight configured by the user, the larger the numerical value of the calculation engine weight is, the larger the probability of being selected is;

(4) the intelligent selection mode comprises the following steps: selecting a current relatively idle computing engine to execute according to the running performance of each cluster of spark and flink;

(5) user-defined extension: setting engine selection rules by a user and automatically analyzing; the method comprises the following steps: the user designates a calculation engine, calculates according to the engine designated by the user, and returns a result; alternatively, the flink calculation engine is used when the user sets up high real-time streaming data related to kafka.

More preferably, the selection calculation engine in step S302 performs the following specifically:

s30201, extracting data source information from relation algebra and judging whether the data source information is single data information or not:

if the data source is queried by a single data source, directly connecting the original data source by using a jdbc mode to execute the query, and returning a result;

s30202, executing sql using Apache Calcite: based on the original relational algebra tree, optimizing based on a relational algebra optimizing theory; after the optimization is completed, converting the logic plan into a physical execution plan through Rule, converting the physical plan into a Linq expression for execution, and returning a result;

s30203, converting to a flinksql execution: the metadata management layer is utilized to automatically complete the conversion of the user input sql into the format required by the flinksql, automatically complete the temporary table information statement required by the flinkql, eliminate the difference of DDL and avoid the need of defining the data to be accessed through the form of a temporary table by keywords such as WITH; the method comprises the following steps:

s3020301 and flinksql statement information are generated as follows:

(1) extracting database table information in the relational algebra tree, taking one piece of database table information, and recording the one piece of database table information as a link_table_meta;

(2) finding the data source information corresponding to the TABLE from the SET_TABLE_META, and recording the data source information as a flink_ds_meta;

(3) converting the flink_ds_meta into a data source format in a corresponding ENGINE according to the mapping relation in the MAP_DS_META_Engine, and recording the data source format as flink_ds_engine;

(4) finding the corresponding database TABLE field information of the TABLE from the SET_TABLE_META, and recording the database TABLE field information as a link_feild_meta;

(5) converting the flink_feild_meta into a data source format in a corresponding ENGINE according to the mapping relation in the MAP_METATA_META_ENGINE, and recording the data source format as flink_metadata_engine;

(6) completing temporary table information declaration through the combination of the flink_ds_engine and the flink_metadata_engine, and recording declaration information as a flink_table_status;

s3020302, generating an SQL statement which can be identified by the flinkql according to the mapping relation in the MAP_METATA_META_ENGINE by using the USER_SQL_META, and recording the SQL statement as the flink_user_sql;

s3020303, transferring the flink_table_status and the flink_user_sql to a flinksql engine for execution;

s3020304, packaging and returning the execution result;

s30204, converting to sparksql to execute: the metadata management layer is utilized to automatically complete the conversion of the user input sql into a format required by sparks ql, automatically complete the temporary table information statement required by sparks ql, eliminate the difference of DDL and eliminate the need of defining the data to be accessed through the form of a temporary table by USING keywords such as USING; the method comprises the following steps:

s3020401, sparksql declaring information is generated as follows:

(1) extracting database table information in the relational algebra tree, taking one piece of database table information, and recording the one piece of database table information as spark_table_meta;

(2) finding the data source information corresponding to the TABLE from the SET_TABLE_META, and recording the data source information as spark_ds_meta;

(3) converting the spark_ds_meta into a data source format in a corresponding ENGINE according to the mapping relation in the MAP_DS_META_Engine, and recording the data source format as spark_ds_Engine;

(4) finding the corresponding database TABLE field information of the TABLE from the SET_TABLE_META, and recording the database TABLE field information as spark_feild_meta;

(5) converting the spark_fe_meta into a data source format in a corresponding ENGINE according to the mapping relation in the MAP_METATA_META_ENGINE, and recording the data source format as spark_meta_engine;

(6) completing temporary table information declaration through the combination of spark_ds_engine and spark_metadata_engine, and recording declaration information as spark_table_status;

s3020402, generating an SQL statement which can be identified by the sparksql according to the mapping relation in the MAP_METATA_META_ENGINE by using the USER_SQL_META, and recording the SQL statement as the spark_user_sql;

s3020403, transmitting the spark_table_status and the spark_user_sql to a spark ql engine for execution;

s3020404, packaging and returning the execution result.

An Apache Calcite-based cross-data source hybrid engine query system is used for realizing the Apache Calcite-based cross-data source hybrid engine query method; the system comprises a data source layer, a metadata management layer, a hybrid engine layer, an analysis layer and a user interface layer, wherein a set of complete metadata management layer is established on the basis of the bottom data source layer, a plurality of calculation engine supports are provided through the hybrid engine layer, the analysis layer analyzes unified sql input by a user, the engine calculation execution sql is selected, and the user interface layer provides functions of user sql input and query result return;

the data source layer is used for dividing the database into a relational database, a non-relational database, a database based on a Hadoop system and a NewSQL direction database; the relational database comprises Oracle, mySQL and SQL Server; the non-relational database includes Redis, ES, mongoDB; the database based on the Hadoop system comprises Hive and HBase; the NewSQL direction database includes ClickHouse, GP;

the metadata management layer is used for unifying data sources, keywords, field types and functions;

the mixed engine layer is used for combing the computing engine and finishing the initialization mapping of the computing engine; the computing engine comprises jdbc, apache Calcite, sparksql and flinkql;

the analysis layer is used for analyzing the sql, optimizing the sql and selecting a calculation engine to execute according to a calculation engine selection strategy;

the user interface layer is used for users to input the query sql and query the result mapping.

An electronic device, comprising: a memory and at least one processor;

wherein the memory has a computer program stored thereon;

the at least one processor executes the computer program stored by the memory, causing the at least one processor to perform the Apache Calcite-based cross-data source hybrid engine query method described above.

A computer readable storage medium having stored therein a computer program executable by a processor to implement an Apache Calcite-based cross-data source hybrid engine query method as described above.

Apache Calcite is a basic framework which provides standard SQL language, multiple query optimization and various data source connection, can enable users to easily access various data, and realizes the use of SQL query. In addition, calcite also provides a query engine for OLAP and stream processing. Hive, flink, kylin and Durid these frames use Apache Calcite internally.

The cross-data source hybrid engine query method, system, equipment and medium have the following advantages:

the method comprises the steps of firstly, establishing a complete metadata management layer, providing unified query sql for the outside, shielding the details of a bottom data source, flexibly supporting various computing engines, and automatically selecting the computing engines for execution according to the analysis and optimization of the sql;

the invention is compatible with various computing engines, such as spark and flink, and does not modify the original engine;

and thirdly, the invention completes conversion of the twenty-eight principle in a unified sql mode, namely 80% of time is spent on 20% of conventional data operation at present, and 80% of conventional data operation can be completed in 20% of time, so that more time can be spent on solving more complex work and thinking about the value of data.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a cross-data source hybrid engine query system based on Apache Calcite;

FIG. 2 is a block flow diagram of a compute engine executing sql.

Detailed Description

The cross-data source hybrid engine query method, system, device and medium of the present invention are described in detail below with reference to the accompanying drawings and specific embodiments.

Example 1:

the embodiment provides a cross-data source hybrid engine query method, which specifically comprises the following steps:

The establishing a complete database metadata management layer in step S1 of this embodiment is specifically as follows:

s102, establishing unified database metadata;

In step S101 of this embodiment, unified data source information is established, and different data source link parameters are set according to the data source classification as follows:

s10104, mapping the original data source and the unified data source is completed: the mapping of the original data source SET DS ORGIN and the unified data source SET DS META is implemented, denoted MAP DS ORGIN META.

The unified database metadata establishment in step S102 of this embodiment is specifically as follows:

s10202, finishing mapping relation between unified metadata of a database and an original database: in the carding process, the mapping relation of the keywords, the field types and the functions in the original data source SET SET_DS_ORGIN and the corresponding information in the SET_DS_META is respectively realized, and the whole mapping relation is marked as mapping MAP_METADA_ORGIN_META.

In step S103 of this embodiment, reading all table information in the data source is completed according to the data source, and the table information is loaded into the memory as follows:

The comb computing engine in step S2 of this embodiment, the completion engine initialization map is specifically as follows:

In step S3 of this embodiment, the analysis sql is executed by the selection calculation engine as follows:

s301, acquiring various calculation engine selection strategies;

The calculation engine selection policy in step S301 of the present embodiment specifically includes the following cases:

As shown in fig. 2, the selection calculation engine in step S302 of the present embodiment performs the following steps:

s3020301 and flinksql statement information are generated as follows:

s3020304, packaging and returning the execution result;

s3020401, sparksql declaring information is generated as follows:

s3020404, packaging and returning the execution result.

Example 2:

as shown in fig. 1, the present embodiment provides a cross-data source hybrid engine query system based on Apache Calcite, which is used for implementing the cross-data source hybrid engine query method based on Apache Calcite in embodiment 1; the system comprises a data source layer, a metadata management layer, a hybrid engine layer, an analysis layer and a user interface layer, wherein a set of complete metadata management layer is established on the basis of the bottom data source layer, a plurality of calculation engine supports are provided through the hybrid engine layer, the analysis layer analyzes unified sql input by a user, the engine calculation execution sql is selected, and the user interface layer provides functions of user sql input and query result return;

Example 3:

the embodiment also provides an electronic device, including: a memory and a processor;

wherein the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored in the memory, so that the processor executes the cross-data source hybrid engine query method in any embodiment of the invention.

The processor may be a Central Processing Unit (CPU), but may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor implements various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, etc. The memory may also include high-speed random access memory, but may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, memory card only (SMC), secure Digital (SD) card, flash memory card, at least one disk storage period, flash memory device, or other volatile solid state memory device.

Example 4:

the present embodiment also provides a computer readable storage medium having stored therein a plurality of instructions, the instructions being loaded by a processor, to cause the processor to perform the method for querying a cross-data source hybrid engine in any of the embodiments of the present invention. Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.

Examples of storage media for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RYM, DVD-RWs, DVD+RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer by a communication network.

Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.

Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion unit connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion unit is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. An Apache Calcite-based cross-data source hybrid engine query method is characterized by comprising the following steps of:

s2, combing the calculation engine to finish engine initialization mapping;

2. The method for querying a cross-data source hybrid engine based on Apache Calcite according to claim 1, wherein the establishing a complete database metadata management layer in step S1 is specifically as follows:

s102, establishing unified database metadata;

3. The method for querying a cross-data source hybrid engine based on Apache Calcite according to claim 2, wherein the step S101 is to build unified data source information, and to set different data source link parameters according to data source classification specifically as follows:

s10103, combing different types of data source connection parameter information, namely SET_DS_ORGIN, converging to form a unified data source connection pool, namely SET SET_DS_META, wherein the SET SET_DS_META supports expansion;

reading database connection information in a data source SET SET_DS_ORGIN, connecting an original database, reading all TABLE information in the original database, wherein all TABLE information in the original database comprises TABLE names, field types and length information, and recording the TABLE information as SET_TABLE_ORGIN; generating TABLE SET information SET_TABLE_META in the unified data source according to the mapping relation MAP_METATA_ORGIN_META; adding a data source identifier to each data source in the data source SET SET_DS, and marking the data source identifier as flg_ds; renaming the data TABLE when the set_TABLE_ORGIN is mapped to the set_TABLE_META according to the MAP_ORGIN_META, and adding a data source identification flag_ds as a prefix; and loading the mapped data TABLE information set_TABLE_META into the memory.

4. The method for querying a cross-data source hybrid engine based on Apache Calcite according to claim 1, wherein the comb computing engine in step S2, the completion engine initialization map is specifically as follows:

s201, declaring a data source SET in temporary table information in a carding calculation ENGINE, and recording the data source SET as SET_DS_ENGINE;

s202, combing the KEYWORDs, field types and functions in the calculation ENGINE, and respectively recording the KEYWORDs, the field types and the functions as SET_KEYWORD_ENGINE, SET_ FIELDTYPE _ENGINE and SET_FUN_ENGINE;

5. The Apache Calcite-based cross-data source hybrid engine query method of claim 1, wherein the parsing sql in step S3 selects the computing engine to execute the following specific steps:

s301, acquiring various calculation engine selection strategies;

6. The method for querying a cross-data source hybrid engine based on Apache Calcite according to claim 5, wherein the computing engine selection policy in step S301 specifically includes the following cases:

7. The Apache Calcite-based cross-data source hybrid engine query method of claim 5, wherein the selection calculation engine in step S302 performs the specific steps of:

s30203, converting to a flinksql execution: automatically completing the conversion of the user input sql into a format required by the flinksql by using a metadata management layer, automatically completing the temporary table information statement required by the flinksql, and eliminating the difference of DDL; the method comprises the following steps:

s3020301 and flinksql statement information are generated as follows:

s3020304, packaging and returning the execution result;

s30204, converting to sparksql to execute: the metadata management layer is utilized to automatically complete the conversion of the user input sql into a format required by sparks ql, and automatically complete the temporary table information statement required by sparks ql, so as to eliminate the difference of DDL, and the method is concretely as follows:

s3020401, sparksql declaring information is generated as follows:

s3020404, packaging and returning the execution result.

8. An Apache Calcite-based cross-data source hybrid engine query system, characterized in that the system is used for realizing the Apache Calcite-based cross-data source hybrid engine query method according to any one of claims 1-7; the system comprises a data source layer, a metadata management layer, a hybrid engine layer, an analysis layer and a user interface layer, wherein a set of complete metadata management layer is established on the basis of the bottom data source layer, a plurality of calculation engine supports are provided through the hybrid engine layer, the analysis layer analyzes unified sql input by a user, the engine calculation execution sql is selected, and the user interface layer provides functions of user sql input and query result return;

9. An electronic device, comprising: a memory and at least one processor;

wherein the memory has a computer program stored thereon;

the at least one processor executing the memory-stored computer program causes the at least one processor to perform the Apache Calcite-based cross-data source hybrid engine query method of any one of claims 1 to 7.

10. A computer readable storage medium having stored therein a computer program executable by a processor to implement the Apache Calcite-based cross-data source hybrid engine query method of any one of claims 1 to 7.