CN115422205A

CN115422205A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115422205A
Application number: CN202211262636.4A
Authority: CN
Inventors: 孙若曦; 徐飞; 耿立琪; 刘奇; 黄东旭; 崔秋
Original assignee: Pingkai Star Beijing Technology Co ltd
Current assignee: Pingkai Star Beijing Technology Co ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2022-12-02

Abstract

The embodiment of the application provides a data processing method and device, electronic equipment and a storage medium, and relates to the technical field of databases. The method comprises the following steps: receiving a data query request, and determining a target data table corresponding to the data query request; determining a target index corresponding to the target data table from the plurality of redistribution indexes; wherein the data in the redistribution index is stored in the distributed database based on the index column distribution of the redistribution index; when the operation aiming at the data query request points to the index column of the target index, optimizing the original execution plan according to the target index to generate a target execution plan; the operation aiming at the data query request comprises a single-table aggregation operation and/or a multi-table association operation; and operating the target execution plan to obtain a query result corresponding to the data query request. The embodiment of the application realizes query optimization, reduces cross-node data exchange operation, improves data processing efficiency and improves the performance of the whole distributed database.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of database technologies, and in particular, to a data processing method, an apparatus, an electronic device, and a storage medium.

Background

With the development of information technology, databases are widely applied, the storage amount of data is increasing day by day, and the query requirements of users on the data in the databases are more and more complex. This requires optimization of the query against the database to ensure the query efficiency and quality of the database when processing a large number of complex query requests.

A distributed HTAP (Hybrid Transactional and Analytical Processing) database refers to a distributed database that has both OLTP (On-Line Transactional Processing) and OLAP (On-Line Analytical Processing) capabilities.

At present, in a distributed HTAP database, data can only be distributed according to a main key of a table, so that the OLAP query cannot be optimized, a large amount of cross-node data exchange is required in the data query process, the data processing efficiency is low, and the performance of the database is reduced.

Disclosure of Invention

The embodiment of the application provides a data processing method and device, an electronic device and a storage medium, and can solve the problems that a distributed HTAP database in the prior art cannot perform query optimization on OLAP query and is low in data processing efficiency.

The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a method of data processing, the method including:

receiving a data query request, and determining a target data table corresponding to the data query request;

determining a target index corresponding to the target data table from a plurality of redistribution indexes; wherein the data in the redistribution index is stored in a distributed database based on an index column distribution of the redistribution index;

when the operation aiming at the data query request points to the index column of the target index, optimizing an original execution plan according to the target index to generate a target execution plan;

the operation aiming at the data query request comprises a single-table aggregation operation and/or a multi-table association operation;

and operating the target execution plan to obtain a query result corresponding to the data query request.

Optionally, the method further comprises:

taking at least one column in a data table to be queried as an index column;

establishing the redistribution index based on the index column; the redistribution index comprises all rows and all columns of a corresponding data table to be inquired, the redistribution index comprises a plurality of data buckets, and index values of all data rows of the data buckets are the same;

determining the same distribution group to which the redistribution index belongs; the same distribution group comprises a plurality of redistribution indexes with the same index column data distribution.

Optionally, the data buckets with the same index value in the same distribution group are stored in the same database node; and the data buckets with the same index values in the same distribution group are migrated as a whole when data scheduling occurs.

Optionally, the operation for the data query request comprises a single table aggregation operation;

the optimizing the original execution plan according to the redistribution index comprises the following steps:

determining an aggregated data table participating in the single-table aggregation operation, and determining a first index corresponding to the aggregated data table from the plurality of redistribution indexes;

and if the grouping column specified by the single-table aggregation operation contains the index column of the first index, scanning the first index, and deleting a cross-node data exchange operator in the original execution plan.

Optionally, the operation for the data query request comprises a multi-table association operation;

the optimizing the original execution plan according to the redistribution index comprises:

determining at least two associated data tables participating in the multi-table association operation, and determining a second index corresponding to the associated data tables from the plurality of redistribution indexes;

and if the associated column specified by the multi-table associated operation comprises an index column of each second index and each second index belongs to the same distribution group, scanning each second index and deleting the cross-node data exchange operator in the original execution plan.

Optionally, the method further comprises:

if at least one first associated data table meeting preset conditions exists and at least one second associated data table not meeting the preset conditions exists, scanning redistribution indexes corresponding to the first associated data table, eliminating cross-node data exchange operators corresponding to the first associated data table, scanning the second associated data table, and reserving the cross-node data exchange operators corresponding to the second associated data table;

the preset condition is that the associated data table has a redistribution index corresponding to the associated data table, and the associated column specified by the multi-table association operation includes an index column of the redistribution index corresponding to the associated data table.

Optionally, the method further comprises:

maintaining the plurality of redistribution indexes unchanged when data scheduling occurs in the distributed database;

when the data scheduling in the distributed database is finished, updating the plurality of redistribution indexes.

According to another aspect of embodiments of the present application, there is provided a data processing apparatus including:

the target data table determining module is used for receiving a data query request and determining a target data table corresponding to the data query request;

a target index determination module for determining a target index corresponding to the target data table from a plurality of redistributed indexes; wherein the data in the redistribution index is stored in a distributed database based on an index column distribution of the redistribution index;

the optimization module is used for optimizing an original execution plan according to the target index to generate a target execution plan when the operation aiming at the data query request points to the index column of the target index; the operation aiming at the data query request comprises a single-table aggregation operation and/or a multi-table association operation;

and the execution module is used for operating the target execution plan to obtain a query result corresponding to the data query request.

According to another aspect of the embodiments of the present application, there is provided an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any of the data processing methods when executing the computer program.

According to a further aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the data processing methods described above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

by establishing the target indexes of the target data table in advance and distributing the data in the target indexes in the distributed database according to the index columns, query optimization can be realized when single-table aggregation operation and/or multi-table association operation aiming at the data query request point to the index columns of the target indexes. The original execution plan is optimized according to the target index to obtain the target execution plan, cross-node data exchange operation in the target execution plan is reduced, data processing efficiency is improved, and performance of the whole distributed database is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a query optimization system according to an embodiment of the present application;

fig. 3 is a system architecture diagram of a distributed HTAP database according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

In a distributed HTAP database, data has two forms of row storage and column storage, which respectively correspond to an OLTP-oriented scene and an OLAP-oriented scene, and the data of the row storage and the column storage are synchronized in a certain mode. In a HTAP database with strong real-time performance, the synchronization method requires that the data in the row memory and the column memory satisfy the same distribution to ensure the synchronization efficiency, i.e., real-time performance, which requires that the data in the HTAP database should be subject to the distribution in OLTP.

In the distributed HTAP database, in order to respond to the OLTP requirement, data can only be distributed according to the primary key of the table, so that query optimization cannot be performed on OLAP query. If single-table aggregation with grouping listed as non-main key columns or multi-table association with association listed as non-main key columns is contained in OLAP query, inefficient cross-node data exchange can be performed, and the performance of the distributed HTAP database is greatly reduced.

The application provides a data processing method, a data processing device, an electronic device and a storage medium, and aims to solve the above technical problems in the prior art.

The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps, etc. in different embodiments is not repeated.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step S101, receiving a data query request, and determining a target data table corresponding to the data query request.

Specifically, the data query request may be a request for performing data query sent by a user, the data query request may include query object information, and the query object information may be in which data tables the query needs to be performed; the data query request may also include result information of the query, i.e., which information needs to be filtered out from the data tables.

The data table to be queried indicated by the data query request is used as a corresponding target data table, and the data query request may correspond to one target data table or multiple target data tables, which is not limited in the embodiment of the present application.

Step S102, determining a target index corresponding to the target data table from the plurality of redistribution indexes; wherein the data in the redistribution index is stored in the distributed database based on the index column distribution of the redistribution index.

Specifically, before data query is performed, a plurality of redistribution indexes can be established in advance, and data in redistribution index distribution is stored in the distributed database based on index columns of the redistribution indexes. The distributed database can comprise a plurality of database nodes, the database nodes storing the corresponding data can be quickly and accurately located through the redistribution index, and the index columns can be used for representing the distribution positions of the corresponding data in the distributed database.

In the distributed database, the data in the data table is generally stored in a distributed manner based on the primary key, and in the embodiment of the application, the data in the data table can be redistributed by establishing the redistribution index so as to meet the condition of subsequent query optimization.

The target index may be a redistribution index corresponding to the target data table, and the target data table may correspond to one target index or a plurality of target indexes, which is not limited in this embodiment of the present application.

Step S103, when the operation aiming at the data query request points to the index column of the target index, optimizing the original execution plan according to the target index to generate a target execution plan;

the operations directed to the data query request include single table aggregation operations and/or multi-table association operations.

Specifically, to obtain the result information in the data query request, a series of operations need to be performed on the target data table, and the operations for the data query request may include a single-table aggregation operation and/or a multi-table association operation. The single-table aggregation operation may be an operation of aggregating or merging multiple rows or multiple columns of data in one table, and the multi-table association operation may be an operation of associating multiple tables.

The following description will be given taking as an example tables a and B, where table a is a sales detail table containing sales detail numbers (primary keys), sold commodities, suppliers of commodities, and sales prices, which are different for each sales record; table B is a purchase list including a purchase list number (primary key) different for each purchase record, purchased goods, suppliers of the goods, and a purchase price.

Table a: sales statement

Sale detail number	Goods of commerce	Suppliers of goods	Selling price
				1	A	First of all	10
2	A	First of all	10
				3	A	Second step	15
4	B	First of all	20
				5	B	Second step	25
6	B	Second step	25
				7	C	First of all	35
8	C	Second step	30
				9	C	First of all	35
10	C	Second step	30

Table B: purchase statement

Purchase detail number	Commodity	Suppliers of goods	Purchase price	Purchase amount
					1	A	First of all	8	100
2	A	Second step	10	100
					3	B	First of all	18	100
4	B	Second step	20	100
					5	C	First of all	30	100
6	C	Second step	28	100

For example, the data query request 1 is "query sales number of all commodities per supplier", all sales records are grouped by suppliers in table a, and the number of records grouped by each supplier is counted, and the operation is a single table aggregation operation, that is, the operation for the data query request 1 includes a single table aggregation operation.

For another example, the data query request 2 is "query gross sales of all commodities of each supplier", and the sales records in table a and the purchase records in table B need to be taken out, and the records on both sides are matched by using the supplier column, and this operation is a multi-table association operation, that is, the operation for the data query request 2 includes a multi-table association operation.

In the prior art, in order to meet the OLTP requirement, the data table is in a distributed HTAP database stored according to the distribution of the primary keys, taking table a as an example, the primary key sales details of table a include 10 rows of data, and the rows of data with details 1 to 5 may be stored in database node DB1, and the rows of data with details 6 to 10 may be stored in database node DB2. In order to satisfy the data query request 1, distributed calculation is required, a supplier A is calculated by DB1, a supplier B is calculated by DB2, since the sales records of the supplier A and the supplier B are dispersed in DB1 and DB2, the DB1 and DB2 are required to perform cross-node data exchange, a data line of the local supplier B is transmitted to DB2 by DB1, a data line of the local supplier A is transmitted to DB1 by DB2, and then DB1 and DB2 can calculate the total sales number of the supplier A and the supplier B respectively.

In the embodiment of the present application, a redistribution index table a' may be generated based on the provider column in table a as an index column, as follows:

watch A'

Sale detail number	Commodity	Suppliers of goods	Selling price
				1	A	First of all	10
2	A	First of all	10
				4	B	First of all	20
7	C	First of all	35
				9	C	First of all	35
3	A	Second step	15
				5	B	Second aspect of the invention	25
6	B	Second step	25
				8	C	Second step	30
10	C	Second aspect of the invention	30

Supplier column in table a ' is an index column, and five lines of data for supplier a in table a ' may be stored in database node DB1, and five lines of data for supplier b in table a ' may be stored in database node DB2.

On the basis that the table a 'is stored in the distributed database in the above-described distribution manner, when the data query request 1 is received, the target data table is table a, the index column is table a', and the target index is the supplier column. When the supplier is designated as a grouping column, namely an index column of a target index, aiming at the single-table aggregation operation in the data query request 1, in order to obtain the query result corresponding to the data query request 1, namely the total quantity of commodity sales purchased by the supplier A and the supplier B, the query calculation can be directly and independently carried out on the DB1 and the DB2 respectively, and the same-distribution optimization can be realized without carrying out cross-node data exchange between the DB1 and the DB2.

The same distribution optimization is a query optimization, which means that in a distributed database, a user can usually select any one column in a table as an index column, so that the data of the table can be distributed to different database nodes according to the index column. When single-table aggregation or multi-table association is performed, if the aggregated packet column or the associated association column is the index column, each database node may perform the aggregation or association calculation only locally, so as to avoid data exchange across nodes.

Likewise, a redistribution index table B' may be generated based on the vendor column in Table B as an index column,

table B':

purchase detail number	Commodity	Suppliers of goods	Purchase price	Purchase amount
					1	A	First of all	8	100
3	B	First of all	18	100
					5	C	First of all	30	100
2	A	Second aspect of the invention	10	100
					4	B	Second aspect of the invention	20	100
6	C	Second step	28	100

The supplier column in table B ' is an index column, and three line data of the supplier a in table B ' may be stored in the database node DB1, and three line data of the supplier B in table B ' may be stored in the database node DB2.

On the basis that the table B ' is stored in the distributed database in the distributed mode, when the data query request 2 is received, the target data table comprises a table A and a table B, the target index comprises a table A ' and a table B ', and index columns of the table A ' and the table B ' are supplier columns. When the supplier column is specified as the association column, namely the index column of the target index, aiming at the multi-table association operation in the data query request 2, in order to obtain the query result corresponding to the data query request 2, namely the goods sale gross profit purchased by the supplier A and the supplier B, as five pieces of data of the supplier A in the table A 'are stored in the DB1 and three pieces of data of the supplier A in the table B' are stored in the DB1, the data can be directly associated and calculated on the DB1 to obtain the goods sale gross profit of the supplier A, and cross-node data exchange with the DB2 is not needed; in the same way, the commodity sales gross profit of the supplier B can be obtained by directly performing correlation calculation on the DB2 without performing cross-node data exchange with the DB 1; i.e. a same distribution optimization can be achieved.

It should be noted that the following embodiments are all described on the basis of the above examples, but the above examples do not limit the data processing method of the present application.

According to the principle, when the single-table aggregation operation and/or the multi-table association operation of the data query request point to the index column of the target index, query optimization can be achieved, and then the original execution plan can be optimized according to the target index, and the optimized execution plan is used as the target execution plan. The execution plan may implement an operation set of a certain query request, and the original execution plan may be an execution plan generated on the premise that the target data table is distributed based on the primary key column.

The original execution plan is optimized according to the target index to obtain the target execution plan, cross-node data exchange operation in the target execution plan can be reduced as much as possible, data processing efficiency is improved, and performance of the whole distributed database can be improved.

In addition, the query optimization in the embodiment of the present application is implemented based on the index, and the distributed database can maintain consistency of the target data and the corresponding target index by processing the target data and the corresponding target index simultaneously in the transaction.

And step S104, operating the target execution plan to obtain a query result corresponding to the data query request.

Specifically, after the target execution plan is obtained, the target execution plan may be run, and a query result corresponding to the data query request may be output.

In the embodiment of the application, the target indexes of the target data table are established in advance, and the data in the target indexes are distributed in the distributed database according to the index columns, so that query optimization can be realized when single-table aggregation operation and/or multi-table association operation aiming at the data query request points to the index columns of the target indexes. The original execution plan is optimized according to the target index to obtain the target execution plan, cross-node data exchange operation in the target execution plan is reduced, data processing efficiency is improved, and performance of the whole distributed database is improved.

As an alternative embodiment, the method further comprises:

taking at least one column in a data table to be queried as an index column;

establishing a redistribution index based on the index columns, wherein the redistribution index comprises all rows and all columns of the corresponding data table to be inquired, the redistribution index comprises a plurality of data buckets, and the index values of all the data rows of the data buckets are the same;

Specifically, a data table to be queried may be obtained in advance, the data table to be queried may be a data table that may be involved in data query, and the data table to be queried may include a target data table.

At least one column is selected from the data table to be queried as an index column, where any column in the data table to be queried may be used as the index column, or a combination of at least two columns in the data table to be queried may be used as the index column, and the at least two columns may or may not be in an adjacent relationship, which is not limited in the embodiment of the present application.

The index column may include a plurality of index values, different index values may be arranged in a certain order, and the same index value may be arranged consecutively. Based on the distribution of each index value in the index column, each row of data in the data to be queried can be reordered according to the corresponding index value. And taking the sorted data table as a redistribution index of the data table to be inquired. Examples of the data table to be queried and its redistribution index can be found in table a and table a 'and table B' in the above embodiments.

The index column may include a plurality of index values, different index values may be arranged in any order, and the same index values may be arranged consecutively. Based on the distribution of each index value in the index column, each row of data in the data to be queried can be rearranged according to the corresponding index value, so that the data can be arranged to the position of the corresponding index value. And taking the arranged data table as a redistribution index of the data table to be inquired. Examples of the data table to be queried and its redistribution index can be found in table a and table a 'and table B' in the above embodiments.

And taking a set formed by all the line data with the same index value as a data bucket, wherein the index values of all the line data included in the data bucket are the same. The redistribution index built according to the above steps includes a plurality of data buckets, and includes all rows and all columns corresponding to the data to be queried.

The redistribution index and the corresponding data content contained in the data table to be inquired are the same, the redistribution index contains all rows and all columns of the data table to be inquired, and the redistribution index is established to realize the redistribution of the data in the data table to be inquired in the distributed database so as to meet the condition of subsequent inquiry optimization.

In the embodiment of the application, the redistribution index is established by rearranging the data table to be queried according to the index columns without changing the composition of row data in the data table to be queried, so that the redistribution index comprises all rows and all columns of the data table to be queried, namely the redistribution index has the property of clustering index; meanwhile, the redistribution index and the representation of the data to be queried are independently stored in the distributed database, namely the redistribution index has the property of a secondary index. Furthermore, the distributed database can maintain consistency of the data to be queried and their corresponding redistribution indexes by processing both simultaneously in a transaction.

After the redistribution index is established, the same distribution group to which the redistribution index belongs may be determined, where the same distribution group includes a plurality of redistribution indexes with the same data distribution of the index columns, and the data distributions of the index columns of table a 'and table B' in the above example are the same and belong to one same distribution group.

As an alternative embodiment, in the method,

storing data buckets with the same index values in the same distribution group in the same database node; data buckets having the same index value as in the distribution group are migrated as a whole when data scheduling occurs.

Specifically, a same distribution group comprises a plurality of redistribution indexes, a redistribution index comprises a plurality of data buckets, and data buckets with the same index value in the same distribution group are stored in the same database node, for example, the 1 st to 5 th rows of data in a table a 'are a data bucket A1, the 6 th to 10 th rows of data are a data bucket A2, the 1 st to 3 th rows of data in a table B' are a data bucket B1, the 4 th to 6 th rows of data are a data bucket B2, the table a 'and the table B' belong to the same distribution group, suppliers in the data bucket A1 and the data bucket B1 in the same distribution group are both a, that is, the index values are the same, the data bucket A1 and the data bucket B1 are simultaneously stored in the DB1, and the data bucket A2 and the data bucket B2 are simultaneously stored in the DB2.

In addition, the data buckets with the same index value in the same distribution group are migrated as a whole when data scheduling occurs, and thus the data buckets with the same index value can be always stored in the same database node before and after data scheduling, the distribution condition of the data buckets with the same index value cannot be influenced before and after data scheduling, and further the condition of same distribution optimization can be met before and after data scheduling.

As an alternative embodiment, the operation for the data query request includes a single table aggregation operation;

determining an aggregated data table participating in a single-table aggregation operation, and determining a first index corresponding to the aggregated data table from the plurality of redistribution indexes;

and if the grouping column specified by the single table aggregation operation contains an index column of a first index corresponding to the aggregated data table, scanning the first index, and deleting a cross-node data exchange operator in the original execution plan.

Specifically, in the case that the operation for the data query request includes a single table aggregation operation, the optimization process for the original execution plan may include: firstly, taking a data table needing single-table aggregation operation as an aggregation data table; then, determining a redistribution index of the aggregation data table from the plurality of redistribution indexes as a first index; when the grouping column designated by the single-table aggregation operation contains an index column of a first index corresponding to the aggregated data table, replacing the scanning aggregated data table with the first index corresponding to the scanning aggregated data, and deleting a cross-node data exchange operator in the original execution plan; when the above condition is not satisfied, the aggregated data table is scanned.

The grouping column designated by the single-table aggregation operation may be which column in the aggregated data table needs to be designated in the single-table aggregation operation for aggregation/merging, and the cross-node data exchange operator may be an operation unit for performing data exchange between different database nodes.

For the data query request 1 of "querying the sales quantity of all the commodities of each supplier", the data query request 1 includes a single-table aggregation operation, the aggregated data table is table a, the grouping columns are supplier columns, the first index is table a', the index is a supplier column, and the index of the first index is the grouping column specified by the single-table aggregation operation at this time, that is, the optimization condition is satisfied.

Thus, table a 'can be scanned directly, since table a' has been distributively stored in advance according to the supplier column, all sales records of supplier a are stored in DB1, and all sales records of supplier b are stored in DB2. In order to respond to the data query request 1, query calculation can be performed locally in the DB1 and the DB2, and data exchange with other database nodes is not needed, so that a cross-node data exchange operator in an original execution plan can be eliminated, and the same distribution optimization is realized.

As an alternative embodiment, the operation directed to the data query request comprises a multi-table association operation;

and if the associated column specified by the multi-table association operation comprises the index column of each second index and each second index belongs to the same distribution group, scanning each second index and deleting the cross-node data exchange operator in the original execution plan.

Specifically, in the case that the operation for the data query request includes a multi-table association operation, the optimization process for the original execution plan may include: firstly, taking a data table needing multi-table association operation as an association data table, and further obtaining at least two association data tables; then, the redistribution index corresponding to each associated data table is determined from the plurality of redistribution indexes to be used as a second index, and the second indexes with the same number as the associated data tables can be obtained at this time.

When the associated column specified by the multi-table association operation contains the index column of each second index, and each second index belongs to the same distribution group, scanning each associated data table can be replaced by scanning the second index corresponding to the associated data table, and meanwhile, deleting the cross-node data exchange operator in the original execution plan; when the above condition is not satisfied, each associated data table is scanned.

The associated column specified by the multi-table association operation may need to specify which column in the associated data table is associated, and since the multi-table association operation involves multiple associated data tables, the number of associated columns specified by the multi-table association operation may be multiple.

For the data query request 2 "query sales gross profit of all commodities of each supplier", the data query request 2 includes a multi-table association operation, the association data table includes a table a and a table B, the association column is a supplier, the second index includes a table a 'and a table B', at this time, the index columns of the second index column are the same as the association column, and the table a 'and the table B' belong to the same distribution group, that is, the optimization condition is satisfied.

Therefore, the table A 'and the table B' can be directly scanned, distributed storage is performed according to the supplier column in advance according to the table A 'and the table B', data corresponding to the same index value is stored in the same database node, namely all sales records of the supplier A and all purchase record data of the supplier A are stored in the DB1, all sales records of the supplier B and all purchase record data of the supplier B are stored in the DB2, in order to respond to the data query request 2, query calculation can be performed locally in the DB1 and the DB2, data exchange with other database nodes is not needed, and therefore a cross-node data exchange operator in an original execution plan can be eliminated, and the same-distribution optimization is achieved.

As an alternative embodiment, the method further comprises:

if at least one first associated data table meeting the preset condition exists and at least one second associated data table not meeting the preset condition exists, scanning redistribution indexes corresponding to the first associated data table, eliminating cross-node data exchange operators corresponding to the first associated data table, scanning the second associated data table, and reserving the cross-node data exchange operators corresponding to the second associated data table;

the preset condition is that the associated data table has the redistribution index corresponding to the associated data table, and the associated column specified by the multi-table association operation includes an index column of the redistribution index corresponding to the associated data table.

In particular, the multi-table association operation may involve a plurality of association data tables, wherein not every association data table has a redistribution index corresponding thereto, and thus the plurality of association data tables may be divided into a first association data table and a second association data table. When the preset condition is that the redistribution index exists and the associated column specified by the multi-table association operation comprises an index column of the redistribution index corresponding to the associated data table, taking the associated data table meeting the preset condition as a first associated data table; and taking the associated data table which does not meet the preset condition as a second associated data table.

For the first associated data table, the redistribution index corresponding to the first associated data table can be directly scanned, and the cross-node data exchange operator corresponding to the first associated data table is eliminated; for the second associated data table, the second associated data table may be scanned, and the cross-node data exchange operator corresponding to the second associated data table is reserved, so that the distributed association in the original execution plan, in which all the tables need to be subjected to data exchange, is changed into the distributed association, in which only the table that does not satisfy the preset condition needs to be subjected to data exchange.

As an alternative embodiment, the method further comprises:

when data scheduling occurs in the distributed database, maintaining the plurality of redistribution indexes unchanged;

Specifically, when the distributed database is expanded or contracted or hot spot scheduling is performed, data migration occurs in the distributed database, meanwhile, the distribution situation of the data in the distributed database at each database node changes, and the distribution positions of the redistribution indexes in the distributed database also change.

For a data query request received in the data scheduling process, maintaining the plurality of redistribution indexes unchanged, and performing query optimization based on the plurality of redistribution indexes which are established; after all the data in the distributed database is migrated, the distribution of the redistribution indexes in the distributed database has changed, and therefore, the redistribution indexes need to be updated, and a subsequent query optimization is performed by using the new redistribution indexes. After the plurality of redistribution indexes complete the update, the plurality of redistribution indexes before the update may be deleted.

By retaining the original redistribution index during data scheduling and reusing the new redistribution index after data scheduling is completed, the same-distribution optimization can still be effective during data scheduling.

Because the redistribution index and the corresponding data table are stored independently, the data distribution of the redistribution index and the data table is independent, and the scheduling of the redistribution index and the scheduling of the data table are independent and cannot influence each other.

As an alternative embodiment, fig. 2 is a schematic structural diagram of a query optimization system provided in an embodiment of the present application, and as shown in fig. 2, the system includes:

the meta information module 201 is configured to provide a method for creating and deleting a redistribution index for a user, and the user may create or delete the redistribution index for a table through a table creating statement or an alter table statement (an SQL statement).

The transaction module 202 is configured to perform corresponding operations on the redistribution index data synchronously with the insertion, update, and deletion of the table data in a transaction, so as to ensure that the redistribution index data has ACID semantics as the table data, where ACID is four characteristics that a database management system must have in order to ensure that the transaction is correct and reliable in a process of writing or updating data: atomicity (atomicity), consistency (consistency), isolation (isolation), durability (durability).

The distributed storage calculation module 203 comprises a storage submodule 213 and a calculation submodule 223, wherein the storage submodule 213 is used for distributively storing the redistribution index data, and the redistribution index data is dispatched to a designated node according to the instruction of the data dispatching module so as to meet the condition of same distribution optimization; the computation sub-module 223 is used for executing a query plan in a distributed manner, and each node in the distributed cluster can compute the data stored in the node according to the execution plan.

And the query optimizer module 204 is configured to perform same-distribution optimization on single-table aggregation and multi-table association query meeting the condition according to the redistribution index information, so as to eliminate cross-node data exchange operation in the query plan.

And the data scheduling module 205 is configured to schedule data when the distributed cluster is scaled or hot spot is scheduled.

As an alternative embodiment, fig. 3 is a system architecture diagram of a distributed HTAP database provided in an embodiment of the present application, and as shown in fig. 3, the distributed HTAP database is composed of three parts:

SQL layer: the system is responsible for receiving a Structured Query Language (SQL) Query request of a user, generating and optimizing an SQL execution plan and completing some simple calculations;

and a scheduling layer: the system is responsible for deciding how to store data in a distributed manner, for example, how data of a table is divided into a plurality of shares and stored in a plurality of nodes;

storage/computation layer: is responsible for storing data and related query computations.

The redistribution index can be established in the SQL layer, and query optimization can also be performed, wherein the query optimization specifically comprises the following steps:

for a single table aggregate operation, the query optimizer checks whether its aggregate column contains an index column of a redistributed index in the table: if the query result contains the redistribution index, the query optimizer converts the scanning of the table into the scanning of the corresponding redistribution index, and eliminates a data exchange operator in the original execution plan, so that the execution plan is changed into a mode that only single machine aggregation is carried out in the storage/computation nodes of the storage/computation layer; if not, the query optimizer generates the associated distributed aggregated execution plan according to conventional algorithms.

When the multi-table association operation relates to two associated data tables, if two tables participating in association contain a certain redistribution index, so that the associated columns contain index columns of the redistribution index, and the two redistribution indexes belong to the same distributed group, the scanning of the two tables is changed into the scanning of the corresponding redistribution index, and a data exchange operator in an original execution plan is eliminated, so that the execution plan is changed into the mode that only the storage/calculation nodes of a storage/calculation layer need to perform single-machine association in the same storage/calculation node, and all the redistribution indexes in the same distributed group have the same data distribution, so that the data with the same index columns are all located in the same database node in the distributed database;

if one of the tables participating in the association contains a certain redistribution index, so that the association column contains an index column of the redistribution index, the scanning of the table is changed into the scanning of the redistribution index, and a cross-node data exchange operator corresponding to the table in the original execution plan is eliminated, so that the execution plan is changed from the distributed association needing data exchange of two tables into the distributed association needing data exchange of only one table;

if neither table participating in the association has an associated redistribution index, the execution plan for the distributed association is generated according to conventional algorithms.

In the scheduling layer, since the redistribution index value is a hash value and the range is an integral value range, the value range can be divided into a plurality of adjacent ranges to be uniformly scheduled to all the database nodes. When the distributed database is expanded or hot spot is scheduled, the value range can be divided again according to the corresponding mechanism, and the same value range division needs to be used for different redistribution indexes of the same distribution group.

In the embodiment of the application, redistribution of data is realized as an index in the distributed HTAP database, and on the premise of maintaining the distribution of the data table according to the main key to efficiently service the OLTP request, the support of data redistribution is realized, so that the redistribution index can be consistent with the data table, the OLAP request can be optimized by using the same distribution, inefficient cross-node data exchange is greatly reduced, and the processing performance of OLAP query is greatly improved.

Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus includes:

a target data table determining module 301, configured to receive a data query request, and determine a target data table corresponding to the data query request;

a target index determining module 302, configured to determine a target index corresponding to the target data table from the plurality of redistribution indexes; wherein the data in the redistribution index is stored in a distributed database based on an index column distribution of the redistribution index;

an optimization module 303, configured to, when the operation for the data query request points to the index column of the target index, optimize an original execution plan according to the target index, and generate a target execution plan; the operation aiming at the data query request comprises a single-table aggregation operation and/or a multi-table association operation;

and the execution module 304 is configured to run the target execution plan to obtain a query result corresponding to the data query request.

As an optional embodiment, the apparatus further comprises a redistribution index creation module for creating a redistribution index for the data stream

Taking at least one column in a data table to be queried as an index column;

As an alternative embodiment, the data buckets with the same index value in the same distribution group in the device are stored in the same database node; and the data buckets with the same index values in the same distribution group are migrated as a whole when data scheduling occurs.

the optimization module is used for determining an aggregation data table participating in the single-table aggregation operation and determining a first index corresponding to the aggregation data table from the redistribution indexes;

As an alternative embodiment, the operation for the data query request includes a multi-table association operation;

the optimization module is used for determining at least two associated data tables participating in the multi-table association operation and determining a second index corresponding to the associated data tables from the plurality of redistribution indexes;

As an alternative embodiment, the optimization module is further configured to:

As an alternative embodiment, the apparatus further comprises:

an update module to maintain the plurality of redistribution indexes unchanged when data scheduling occurs in the distributed database;

The apparatus in the embodiment of the present application may execute the method provided in the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus in the embodiments of the present application correspond to the steps in the method in the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

In an embodiment of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory, where the processor executes the computer program to implement the steps of the data processing method, and compared with the related art, the steps of: by establishing the target indexes of the target data table in advance and distributing the data in the target indexes in the distributed database according to the index columns, query optimization can be realized when single-table aggregation operation and/or multi-table association operation aiming at the data query request point to the index columns of the target indexes. The original execution plan is optimized according to the target index to obtain the target execution plan, cross-node data exchange operation in the target execution plan is reduced, data processing efficiency is improved, and performance of the whole distributed database is improved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 4000 shown in fig. 5 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but that does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, and is not limited herein.

The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and execution is controlled by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as needed, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in the present application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of the present application are also within the protection scope of the embodiments of the present application without departing from the technical idea of the present application.

Claims

1. A data processing method, comprising:

2. The data processing method of claim 1, wherein the method further comprises:

taking at least one column in a data table to be queried as an index column;

3. The data processing method according to claim 2, wherein the data buckets with the same index value in the same distribution group are stored in the same database node; and the data buckets with the same index values in the same distribution group are migrated as a whole when data scheduling occurs.

4. The data processing method of claim 1, wherein the operation on the data query request comprises a single table aggregation operation;

determining an aggregate data table participating in the single-table aggregation operation, and determining a first index corresponding to the aggregate data table from the plurality of redistribution indexes;

5. The data processing method of claim 1, wherein the operation on the data query request comprises a multi-table association operation;

6. The data processing method of claim 5, wherein the method further comprises:

if at least one first associated data table meeting preset conditions exists and at least one second associated data table not meeting the preset conditions exists, scanning a redistribution index corresponding to the first associated data table, eliminating a cross-node data exchange operator corresponding to the first associated data table, scanning the second associated data table, and reserving the cross-node data exchange operator corresponding to the second associated data table;

7. The data processing method according to any one of claims 1 to 6, characterized in that the method further comprises:

8. A data processing apparatus, characterized by comprising:

the target data table determining module is used for receiving the data query request and determining a target data table corresponding to the data query request;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.