CN117493429A

CN117493429A - Processing system and method for heterogeneous data joint query

Info

Publication number: CN117493429A
Application number: CN202311320529.7A
Authority: CN
Inventors: 裴衡
Original assignee: QIMING INFORMATION TECHNOLOGY CO LTD
Current assignee: QIMING INFORMATION TECHNOLOGY CO LTD
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2024-02-02

Abstract

The invention discloses a processing system and a processing method for heterogeneous data joint query, wherein the system comprises a data source connection and access module, a data model and grammar difference module, a query optimization and performance module, a data transmission and integration module, a security and authority control module and an exception handling and fault tolerance mechanism module. The processing system and the processing method for the heterogeneous data joint query improve the efficiency and the accuracy of data processing and provide better data query and analysis experience for users through the advantages of data integration, improvement of query flexibility, resource optimization, unified query interface, real-time query and analysis, expansibility and compatibility and the like.

Description

Processing system and method for heterogeneous data joint query

Technical Field

The invention relates to the field of database management and data query, in particular to a processing system and a processing method for heterogeneous data joint query.

Background

In modern enterprises and organizations, it has become commonplace to use a plurality of different database engines and clusters to manage and store data. However, there are challenges to doing federated queries directly on heterogeneous data, which need to address the shortcomings of the following prior art:

data replication and data integration: current methods typically involve copying heterogeneous data into a central data warehouse or data lake, then performing joint queries in the central store, requiring extensive data copying and synchronization operations, taking up storage space and network bandwidth, and placing a limit on the real-time nature of the data.

ETL extraction, conversion and loading processes: many organizations use ETL tools to extract heterogeneous data into a unified format for joint querying, however, ETL processes are complex and time consuming, requiring definition and maintenance of data transformation rules and workflow. In addition, the ETL process is processed in batches, and the requirement of real-time query cannot be met.

Database linking and cross-engine query: some database engines provide functionality to link to other engines, making it possible to access multiple engines in one query, however, links are often limited, applicable only to specific engines or limited data operations, and cannot meet complex federated query requirements. In addition, the query grammar and functions of different engines are different, so that the writing and debugging of the query statement are complex.

Data model differences: heterologous data typically has different data structures, data models, and query grammars. For example, there are differences in structured and unstructured data between relational databases and NoSQL databases, making it difficult to directly conduct joint queries. Current methods typically require data model conversion and mapping, adding additional development and maintenance costs.

Query performance and optimization: heterogeneous data joint queries often involve multiple data sources and complex query plans, query performance may be affected by data transmission, network delays, and data processing, in the prior art, query optimization is often performed for a single data source, the advantages of different data sources cannot be fully utilized, and global optimization strategies are lacking.

Thus, the prior art has several drawbacks in handling challenges based on heterogeneous data federated queries, including complexity of data replication and integration, limitations of ETL flow, limitations of database linking, handling of data model differences, and optimization of query performance.

In summary, while existing federated query techniques and tools provide powerful support, there is a lack of a system that helps users to more efficiently query and integrate data in heterogeneous database environments, increasing the flexibility and performance of data processing.

Disclosure of Invention

The invention aims to solve the technical problems of data source connection and access, data model and grammar difference, query optimization and performance, data transmission and integration, security and authority control and exception handling and fault tolerance mechanisms, and provides a processing system and method for heterogeneous data joint query.

A processing system for heterogeneous data joint query comprises a data source connection and access module, a data model and grammar difference module, a query optimization and performance module, a data transmission and integration module, a security and authority control module, an exception handling and fault tolerance mechanism module:

the data source connection and access module is used for connecting programs by using a database and configuring parameters for the database;

the data model and grammar difference module uses different data models and query grammars for different database engines and clusters;

the query optimization and performance module functions are to analyze query sentences and generate a query plan by using a query analysis and optimization technology;

the data transmission and integration module is used for carrying out data transmission, combination and integration among different data sources;

the security and authority control module is used for ensuring the security of data and perfecting the authority control mechanism of each data source when the cross-data source inquiry is carried out;

the exception handling and fault tolerance mechanism module functions to implement exception handling and fault tolerance mechanisms, ensuring correct execution of queries and results.

Further, the processing system for the heterogeneous data joint query is characterized in that the data source connection and access module comprises a data source connection sub-module and a data source access sub-module;

the data source connection submodule is used for supporting connection of different database engines and clusters by using a database connection driver or an API;

the data source access submodule is used for configuring connection parameters and authority verification information for each data source and realizing safe access to data.

Further, the processing system for the heterogeneous data joint query comprises a data model sub-module and a grammar difference sub-module;

the data model submodule functions are used for establishing a unified data model, mapping data of different data sources into the unified model and eliminating data model differences;

the grammar difference submodule functions are used for developing a cross-data-source query grammar converter, converting query sentences from one grammar to another grammar and realizing the requirements of different database engines.

Further, the system for processing the heterogeneous data joint query comprises a query optimization and performance module and a query acceleration operation module, wherein the query optimization and performance module comprises a query performance improvement sub-module and a query acceleration operation sub-module;

the function of the sub-module for improving the query performance is to select the best data source and query plan by utilizing the characteristic and index information of each data source;

the accelerated query operation submodule is used for distributing the query task to each data source and executing the query task in parallel.

Further, the processing system for the heterogeneous data joint query is characterized in that the data transmission and integration module comprises a data transmission sub-module and a data integration sub-module;

the data transmission submodule is used for adopting a data transmission mode according to the query requirement and the data volume;

the data transmission mode comprises batch transmission and streaming transmission;

the data integration submodule functions to handle large-scale data integration using a distributed computing framework or a data processing engine.

Further, the processing system for the heterogeneous data joint query is characterized in that the security and authority control module comprises a data source security sub-module and an authority control sub-module;

the data source security submodule is used for carrying out authority verification and identity verification on the query of the cross data source so as to ensure the security and source compliance of the data;

the permission control submodule is used for ensuring that a user can only access the data with permission in query processing, and the security and permission control mechanism of each data source is followed.

Further, the processing system for the heterogeneous data joint query is characterized in that the exception handling and fault tolerance mechanism module comprises an exception handling machine sub-module and a fault tolerance machine sub-module;

the abnormal handling mechanism submodule is used for capturing and handling abnormal situations of connection failure and inconsistent data;

the fault tolerance mechanism submodule functions include retry connection, data recovery or rollback operation.

A processing method for heterogeneous data joint query comprises the following steps:

s1: determining the requirement of the joint query, and combining the data in the databases into a result set;

s2: storing the URLs and parameter information of a plurality of databases in a configuration file, and carrying out parameter transfer for executing joint inquiry;

s3: selecting one of the databases to execute SQL query, and extracting a required result set;

s4: exporting the result set data into a Markdown file, and analyzing by using an online Markdown analyzer;

s5: selecting another database to execute the same SQL query, and reading the required data;

s6: combining the read data with the derived result set, and updating the data in the MarkDown file;

s7: the updated Markdown file is uploaded to a server, and a user reads the Markdown file locally.

The invention has the beneficial effects that: by the processing system and the processing method for the heterogeneous data joint query, the efficiency and the accuracy of data processing are improved, and better data query and analysis experience is provided for users through the advantages of data integration, query flexibility improvement, resource optimization, unified query interface, real-time query and analysis, expansibility and compatibility and the like.

Drawings

Fig. 1 is a system configuration diagram of the present invention.

Fig. 2 is a system service architecture layer of the present invention.

Fig. 3 is a flow chart of the method of the present invention.

Detailed Description

For a clearer understanding of technical features, objects, and effects of the present invention, a specific embodiment of the present invention will be described with reference to the accompanying drawings.

As shown in fig. 1, a processing system for heterogeneous data joint query comprises a data source connection and access module, a data model and grammar difference module, a query optimization and performance module, a data transmission and integration module, a security and authority control module, and an exception handling and fault tolerance mechanism module:

Because the heterogeneous data are usually stored in different database engines and clusters and have different connection and access modes, the technical problem is how to establish connection with a heterogeneous data source, and can effectively acquire the data and execute query operation, and therefore, the data source connection and access module comprises a data source connection sub-module and a data source access sub-module;

Because different database engines and clusters have different data models and query grammars, the technical problem is how to process and unify the data models and query grammar differences among different data sources, so that the data models and the grammar difference modules can perform joint query, and comprise a data model sub-module and a grammar difference sub-module;

Because the heterogeneous data joint query can relate to a plurality of data sources and complex query operation, the technical problem is how to perform query optimization, select a proper query plan, and utilize the advantages of each data source to improve the query performance and the execution efficiency, wherein the query optimization and performance module comprises a query performance improving sub-module and a query accelerating operation sub-module;

Because the heterogeneous data joint query may need to perform data transmission, merging and integration between different data sources, the technical problem is how to effectively process data transmission and integration operation, reduce network delay and data processing overhead, and the data transmission and integration module comprises a data transmission sub-module and a data integration sub-module;

Because the heterogeneous data joint query may relate to the security and authority control of a plurality of data sources, the technical problem is how to ensure the security and compliance of data when the cross-data-source query is performed, and the authority control mechanism of each data source can be correctly applied, wherein the security and authority control module comprises a data source security sub-module and an authority control sub-module;

Due to the complexity of the heterogeneous data, various abnormal conditions, such as connection failure, inconsistent data and the like, may occur in the query process, and the technical problem is how to design and implement an abnormal processing and fault tolerant mechanism so as to ensure the correct execution of the query and the accuracy of the result, wherein the abnormal processing and fault tolerant mechanism module comprises an abnormal processing machine sub-module and a fault tolerant machine sub-module;

As shown in fig. 2, the heterogeneous database query processing layer is responsible for sending a query request, acquiring data from a plurality of clusters after receiving the query request, and merging the query results into a final result;

the query result storage layer is responsible for storing query results into different data stores, and ensuring consistency and availability of the query results by using a caching technology;

the inquiry result inquiry engine layer is responsible for receiving the inquiry result through the middleware, carrying out service logic processing according to the inquiry result and finally outputting the result;

the final output result layer is responsible for outputting the results using the Markdown parsing engine.

As shown in fig. 3, a processing method for heterogeneous data joint query comprises the following steps:

The specific core codes are as follows:

public Pair<List<String>, List<List<String>>>execute(ExecuteSqlDTO dto) {

JdbcBase jdbcBase = dto.getJdbcBase();

JdbcConfig jdbcConfig = jdbcBase.getJdbcConfig();

String sqlContent = dto.getSql();

String key = dto.getKey();

String type = dto.getType();

Integer count = dto.getMaxCount();

FunctionTypeEnum functionType = dto.getFunctionType();

Pair<List<String>, List<List<String>>>columnAndResultPair = null;

Connection connection = null;

Statement statement = null;

ResultSet resultSet = null;boolean hasResultSet = false;

try {

String[] subSqlArr = sqlContent.split(";");

log.info("JdbcBaseBiz.execute.subSqlArr:{},",

JsonUtils.toJson(subSqlArr));

sqlrunodis manager. Addlog (key, "start connection database … …",

SqlLogLevelEnum.WARNING.getCode(), false);

data source for/(and/or acquisition)

connection = getConnection(jdbcBase, type);

sqlrunodis manager. Addlog (key, "get to database connection … …",

SqlLogLevelEnum.WARNING.getCode(), false);

statement = connection.createStatement();

if (StringUtils.isNotEmpty(jdbcConfig.getSchema())) {

setting oracle schema

statement.execute(SQL_SENTENCE_JOIN + jdbcConfig.getSchema());

}

statement.setMaxRows(count);

if (statement instanceof HiveStatement) {

sqlStatementMonitor.monitoringStatement(statement, key);

}

for (String sql : subSqlArr) {

if (StringUtils.isNotEmpty(sql.trim())) {

sql = SqlCheckUtil.deleteNote(sql);

if (StringUtils.isEmpty(sql.trim())) {

continue;

}

sqlrunedis manger.addlog (key, "start execution of sql:" +sql,

SqlLogLevelEnum.INFO.getCode(), false);

hasResultSet =

statement.execute(StringUtils.trimWhitespace(sql));

}

if (hasResultSet&&Objects.nonNull(resultSetHandler)) {

resultSet = statement.getResultSet();

results of the/(and/or processing)

columnAndResultPair =

resultSetHandler.handleResultSet(functionType, resultSet, key);

} else {

int updateCount = statement.getUpdateCount();

log.error ("jdbcbasebiz. Execution. Updatecount affects the number of lines: { }",

updateCount);

}

} catch (Exception e) {

log.error("JdbcBaseBiz.execute.e", e);

sqlreducismanger.addlog (key, "execution error:" +e getmessage (),

SqlLogLevelEnum.ERROR.getCode(), true);

buildUpDateHistoryQuery(key,

SqlQueryStatusEnum.EXCEPTION.getCode());

} finally {

sqlnrudredismanger.addlog (key, "end of execution",

SqlLogLevelEnum.WARNING.getCode(), true, null, System.currentTimeMillis());

close(connection, statement, resultSet, jdbcBase);

}

return columnAndResultPair;

}。

the processing system and the processing method for the heterogeneous data joint query have the advantages that:

(1) Data integration and consistency: and the data integration and consistency are realized by jointly inquiring the data of different data sources. The organization can acquire the required information from a plurality of data sources without copying data or performing complicated data conversion operation, so that the problems of data redundancy and inconsistency are reduced;

(2) The query flexibility is improved: the heterogeneous data joint query method allows the user to execute query operation in different database engines and clusters, provides greater flexibility and freedom, enables the user to span different data sources, obtains a more comprehensive data view, and performs more complex query and analysis;

(3) Resource optimization and performance improvement: by means of query optimization and parallel query execution technology, the heterogeneous data combined query method can utilize advantages of all data sources to the greatest extent, query performance and execution efficiency are improved, meanwhile, overhead of data copying and integrating processes is avoided, and storage space and network bandwidth are saved;

(4) Unified query interface and grammar: the heterogeneous data joint query method provides a unified query interface and grammar, so that a user does not need to learn and adapt to a plurality of database engines and query grammars, and the user accesses and queries heterogeneous data sources through a unified interface by using familiar query sentences, thereby simplifying the complexity of development and query;

(5) Real-time data query and analysis: the heterogeneous data combined query method supports real-time query and analysis, does not need batch data extraction and conversion, and a user can acquire the latest data immediately and conduct real-time data analysis and decision making, so that the service response speed and accuracy are improved;

(6) Extensibility and compatibility: the heterogeneous data combined query method has good expansibility and compatibility, integrates with different database engines and clusters, and can integrate through proper interfaces and adapters no matter a relational database, a NoSQL database or other types of data storage systems.

The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The processing system for the heterogeneous data joint query is characterized by comprising a data source connection and access module, a data model and grammar difference module, a query optimization and performance module, a data transmission and integration module, a security and authority control module and an exception handling and fault tolerance mechanism module:

2. The system of claim 1, wherein the data source connection and access module comprises a data source connection sub-module and a data source access sub-module;

3. The system for processing heterogeneous data joint queries according to claim 1, wherein the data model and syntax difference module comprises a data model sub-module and a syntax difference sub-module;

4. The heterogeneous data joint query processing system of claim 1, wherein the query optimization and performance module comprises a query performance improvement sub-module and an accelerated query operation sub-module;

5. The system for processing heterogeneous data joint queries according to claim 1, wherein the data transmission and integration modules comprise a data transmission sub-module and a data integration sub-module;

6. The system for processing heterogeneous data joint queries according to claim 1, wherein the security and entitlement control module comprises a data source security sub-module and an entitlement control sub-module;

7. The system of claim 1, wherein the exception handling and fault tolerance mechanism module comprises an exception handling machine sub-module, a fault tolerance machine sub-module;

8. The method for processing the heterogeneous data joint query is realized based on the processing system for processing the heterogeneous data joint query according to any one of claims 1-7, and is characterized by comprising the following steps:

s2: storing the URLs and parameter information of a plurality of databases in a configuration file, and carrying out parameter transfer for executing joint inquiry; s3: selecting one of the databases to execute SQL query, and extracting a required result set;