CN115563103B

CN115563103B - Multi-dimensional aggregation method, system, electronic equipment and storage medium

Info

Publication number: CN115563103B
Application number: CN202211121862.0A
Authority: CN
Inventors: 王帅
Original assignee: Henan Xinghuan Zhongzhi Information Technology Co ltd; Transwarp Technology Shanghai Co Ltd
Current assignee: Henan Xinghuan Zhongzhi Information Technology Co ltd; Transwarp Technology Shanghai Co Ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-12-08
Anticipated expiration: 2042-09-15
Also published as: CN115563103A

Abstract

The invention discloses a multi-dimensional aggregation method, a multi-dimensional aggregation system, electronic equipment and a storage medium. Comprising the following steps: obtaining, by a scheduling component, a plurality of data blocks from a downstream operator; performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators in the aggregation work task assembly to obtain a plurality of single-dimensional column data, and then persisting the data in the data blocks; generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component, wherein one single-dimensional column data generates a plurality of secondary indexes; and searching and executing the merging of the aggregation tables based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form one aggregation table. The method provides a multi-dimensional aggregation method which is real-time, and optimally balanced by the memory and the central processing unit through the linkage of data blocking and secondary indexing.

Description

Multi-dimensional aggregation method, system, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of multi-dimensional aggregation, in particular to a multi-dimensional aggregation method, a multi-dimensional aggregation system, electronic equipment and a storage medium.

Background

Multidimensional aggregated analysis is a business intelligence (Business Intelligence, BI) requirement common to enterprises. And performing dimension combination on a plurality of service indexes to analyze the performance of the service data under different service dimensions. The multidimensional aggregation algorithm mainly comprises the following implementation modes:

in the first mode, kylin calculates the result in advance when the database is idle and stores the result in a temporary table, and the result is directly read from the temporary table when a user initiates a multi-dimensional aggregation query request.

And in the second mode, the guide, the database bottom data structure and the basic interface are specially designed for multidimensional aggregation, data pre-aggregation and persistence are carried out when the original data are injected, and the data pre-aggregated before are summarized when a user initiates a multidimensional aggregation query request.

And thirdly, performing full ordering on the data in the table, and then performing stream aggregation on multiple dimensions.

The three modes have the following defects: the disadvantage of mode one is that the data is not real-time; the second mode has the defects that the limitation is relatively large, on one hand, the data insertion delay is relatively high due to the pre-aggregation, and on the other hand, the pre-aggregation of the services needs to be determined in advance; the third disadvantage is that in the distributed online transaction (On-Line Transaction Processor, OLTP) scenario, the cost of fully ordering a business analysis table is great.

Disclosure of Invention

The invention provides a multi-dimensional aggregation method, a multi-dimensional aggregation system, electronic equipment and a storage medium, which are used for solving the problems of non-real-time data, larger limitation and larger cost of the existing multi-dimensional aggregation algorithm.

According to an aspect of the present invention, there is provided a multi-dimensional aggregation method including:

obtaining, by a scheduling component, a plurality of data blocks from a downstream operator;

performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators in the aggregation work task assembly to obtain a plurality of single-dimensional column data, and then persisting the data in the data blocks;

generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component, wherein one single-dimensional column data generates a plurality of secondary indexes;

and searching and executing the merging of the aggregation tables based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form one aggregation table.

According to another aspect of the present invention, there is provided a multi-dimensional aggregation system, including a scheduling component, an aggregate task component, and an aggregate index component, where the aggregate task component is connected to the scheduling component and the aggregate index component, respectively;

The scheduling component is used for acquiring a plurality of data blocks from a downstream operator;

the aggregation work task assembly is used for performing single-dimensional aggregation calculation on the plurality of data blocks through a plurality of executor operators to obtain a plurality of single-dimensional column data, and then persisting the data in the plurality of data blocks;

the aggregation index component is used for generating secondary indexes corresponding to the plurality of single-dimensional column data, and generating a plurality of secondary indexes by one single-dimensional column data;

the aggregation work task assembly is further used for searching and executing aggregation table combination based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form one aggregation table.

According to another aspect of the present invention, there is provided an electronic apparatus including: at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-dimensional aggregation method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute the multi-dimensional aggregation method according to any one of the embodiments of the present invention.

According to the technical scheme, a plurality of data blocks are obtained from a downstream operator through a scheduling component; performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators in the aggregation work task assembly to obtain a plurality of single-dimensional column data, and then persisting the data in the data blocks; generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component, wherein one single-dimensional column data generates a plurality of secondary indexes; the aggregation table is combined based on the two-level indexes and the dichotomy lookup execution by the actuator operators to obtain a plurality of target aggregation tables, wherein the aggregation table is a combination of a series of two-level indexes, and a plurality of two-level indexes generated by single-dimension column data form an aggregation table, so that various problems in the prior art are solved, and a multi-dimensional aggregation method capable of providing optimal balance for a real-time internal memory central processor is obtained.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a multi-dimensional aggregation method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a portion of a multi-dimensional aggregation method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a secondary index in a multi-dimensional aggregation method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a first flowchart of merging aggregation tables in a multi-dimensional aggregation method according to an embodiment of the present invention;

FIG. 5 is a diagram of a second flow Cheng Shi merged by an aggregation table in a multi-dimensional aggregation method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a multi-dimensional aggregation system according to a second embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a multi-dimensional aggregation system according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to a multi-dimensional aggregation method in an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention. It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Example 1

Fig. 1 is a flow chart of a multi-dimensional aggregation method according to an embodiment of the present invention, where the method is applicable to a case of performing multi-dimensional aggregation analysis on a distributed OLTP service, and the method may be performed by a multi-dimensional aggregation device, where the device may be implemented by software and/or hardware and is generally integrated on an electronic device, and in this embodiment, the electronic device includes but is not limited to: a computer device.

As shown in fig. 1, a multi-dimensional aggregation method provided in a first embodiment of the present invention includes the following steps:

s110, acquiring a plurality of data blocks from a downstream operator through a scheduling component.

The scheduling component may be a software component Dispatcher with a data scheduling function, and the number of the scheduling components may be 1.

In this embodiment, the plurality of data blocks may be obtained after the data is equally divided by the scheduling component, and the type and number of the data are not specifically limited herein. Wherein the data may include numbers, letters, and combinations of numbers. The number of data blocks is not particularly limited, and illustratively, the downstream operator may divide the data evenly into 3 data blocks. Wherein each data block may include a plurality of columns, each column storing a plurality of data therein.

In this embodiment, the process of the scheduling component obtaining the plurality of data blocks from the downstream operator is not particularly limited, and the downstream operator may send the plurality of data blocks to the scheduling component, or the scheduling component may take the plurality of data blocks from the downstream operator.

Illustratively, the scheduling component sends Next () to the downstream operator, which uniformly divides the data into a plurality of data blocks according to the Next (), from which the scheduling component can take the plurality of data blocks.

S120, performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators in the aggregation work task assembly to obtain a plurality of single-dimensional column data, and then, persisting the data in the data blocks.

The aggregate task component may be a software component Grouping Worker scheduler for executing an aggregate task, the number of aggregate task components may be 1, the aggregate task component may include a plurality of executor operators, parameters of the executor operators may be configured by themselves, and the executor operators may be understood as a Grouping Worker.

In this embodiment, the aggregate job task component may be responsible for scheduling of the overall aggregate computation. The aggregate job component may place the acquired plurality of data blocks into the aggregate job queue Grouping Worker Queue and the executor operator may perform a single-dimensional aggregate computation on the data blocks according to the directed acyclic graph generated by the aggregate job task component. It should be noted that, a single-dimensional aggregate calculation cannot be performed on a plurality of data blocks at the same time.

Grouping Worker Queue may include an aggregate data queue Grouping Data Queue and an aggregate table queue Grouping Map Queue, among other things. Specifically, the acquired plurality of data block chunks may be placed in Grouping Data Queue in Grouping Worker Queue.

Further, performing a single-dimensional aggregation calculation on the plurality of data blocks to obtain a plurality of single-dimensional column data, including: the data blocks are placed into an aggregation work queue, a plurality of multi-dimensional combinations are determined through the aggregation work task module, and one multi-dimensional combination is composed of data corresponding to at least one column; aiming at a target data block, generating a corresponding directed acyclic graph through the aggregation work task module based on a plurality of multidimensional combinations determined by the target data block; traversing the plurality of data blocks through the aggregation work task assembly, and executing single-dimensional aggregation calculation on data corresponding to the first node of a plurality of complete links in the directed acyclic graph to obtain a plurality of single-dimensional column data.

The multidimensional combination can be a combination formed by data of multiple dimensions, and can be understood as a cube, and each side of the cube corresponds to a different dimension. In this embodiment, a multi-dimensional combination can be understood as a combination of data constitution having a plurality of columns. Each data block may correspond to a plurality of different multi-dimensional combinations, e.g., one multi-dimensional combination may include 2 columns of data and one multi-dimensional combination may include 3 columns of data.

By way of example, a data block may include three columns of data a, b, c, and the corresponding multi-dimensional combinations of the data block may be (a, b, c), (a, c), and (c).

In this embodiment, the process of obtaining single-dimensional column data from each data block is the same, and a data block is taken as an example for illustration below: the aggregate job task component may determine a plurality of multi-dimensional combinations based on the target data blocks; the corresponding directed acyclic graph can be generated according to a plurality of multidimensional combinations, and the specific generation process is not described herein; the aggregate job task component can dispatch the aggregate job into corresponding directed acyclic graph element to be treated as a special group; the aggregation work task module can perform single-dimension aggregation calculation on the data of the head node of each complete link in the directed acyclic graph in all the data blocks when traversing the data blocks, so as to obtain a plurality of single-dimension column data.

Wherein the directed acyclic graph may include a plurality of complete links, each link being formed by a plurality of nodes, the flow of arrow mark data between the nodes, a node may represent a column.

Fig. 2 is a schematic partial flow chart of a multi-dimensional aggregation method according to a first embodiment of the present invention, as shown in fig. 2, a scheduling component obtains a data block 0, a data block 1 and a data block 2 from a downstream operator, wherein the three data blocks all include three columns of a, b and c data, the column a data in the data block 1 includes 2 and 3, the column b data includes 456,789 and the column c data includes y and z; transmitting the data block 0, the data block 1 and the data block 2 to an aggregation work queue, wherein the aggregation work task module can generate a corresponding directed acyclic graph DAG according to the data block in the aggregation work task queue, and the directed acyclic graph DAG comprises three complete links a-b-c, b-c and c, wherein a first node in the complete links a-b-c is a column a, a first node in the complete links b-c is a column b, and a first node in the complete links c is a column c; traversing data from an aggregation work task queue, traversing data of a column a, a column b and a column c in a data block 0, a data block 1 and a data block 2, and performing single-dimensional aggregation calculation to obtain three single-dimensional column data, namely group by a, group by b, group by c and group by a, wherein the three single-dimensional column data comprise data 1,2 and 3; the group by b includes data 123,456,789; the group by c includes data x, y, z.

In this embodiment, a data persistent drop can be understood as saving data on a local disk. The data in the data block which completes the single-dimension aggregation calculation is temporarily stored and permanently dropped, so that real data is not required to be used in subsequent calculation, and the memory consumption is greatly reduced.

S130, generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component, and generating a plurality of secondary indexes by one single-dimensional column data.

The aggregation index component may be a software component with a secondary index function, and the number of the aggregation index components may be multiple. The aggregation index component may perform a secondary index through key-value pairs, with keys in the key-value pairs as data block indexes and values in the key-value pairs as row indexes. The data block and row where each data in the single-dimensional column data is located can be known from the secondary index.

Further, a single-dimensional column data generates a plurality of secondary indexes, each secondary index has a corresponding key value pair, a key in the key value pair represents a single-dimensional data, and a value in the key value pair represents one data in a single-dimensional column data; one secondary index includes a data block index identifying the data block in which the one single-dimensional column data is located and a row index identifying the column in which the one single-dimensional column data is located.

For example, fig. 3 is a diagram illustrating a two-level index example in a multi-dimensional aggregation method according to a first embodiment of the present invention, as shown in fig. 3, where a single-dimensional column data a is taken as an example, 3 two-level indexes may be generated by the single-dimensional column data a, a Key value pair key_a in key_a 1 of the first two-level index represents the single-dimensional column data a,1 represents one data 1 in the single-dimensional column data a, blk_idx in the first two-level index is 0, which represents that data 1 is in data block 0, and row_idx in the first two-level index is 0, which represents that data 1 is in the 0 th row of data block 0; key_A in Key-value pair Key_A of the second level index represents single-dimensional column data a,2 represents one data 2 in single-dimensional column data a,2 data 2 are included in the single-dimensional column data a according to blk_idx and row_idx in the second level index, one data 2 is in row 1 of data block 0, and the other data 2 is in row 0 of data block 1; key_A in key_A:3 of the third secondary index represents single-dimensional column data a,3 represents one data 3 in the single-dimensional column data a,3 data 3 are included in the single-dimensional column data a according to blk_idx and row_idx in the second secondary index, the first data 3 is in row 1 of data block 1, the second data 3 is in row 0 of data block 2, and the third data 3 is in row 1 of data block 2.

And S140, searching and executing the merging of the aggregation tables based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by one single-dimensional column data form one aggregation table.

In this embodiment, all computations may be performed by the actuator operator, which may include computing grouping by exprs values to generate an aggregate table; merging the aggregation tables; if the aggregation table does not need to be continuously combined, calculating an aggregation result and outputting the aggregation result.

Wherein the aggregation table can be understood as a hash table, and the aggregation table combination can be understood as multi-dimensional aggregation.

In this embodiment, the multiple executor operators may combine, according to the data flow direction in each complete link in the directed acyclic graph, an aggregation table formed by multiple secondary indexes generated by column data in each single dimension in each complete link, so as to obtain multiple target aggregation tables, where each complete link corresponds to one target aggregation table. If a complete link includes multiple nodes, multiple multidimensional aggregations may be performed to obtain a target aggregation table.

Further, the merging of the aggregation tables is performed based on the secondary index and the binary search, so as to obtain a plurality of target aggregation tables, which comprises the following steps: and performing aggregation table merging on each node in each complete link in the directed acyclic graph based on the secondary index and the dichotomy search to obtain a plurality of target aggregation tables, wherein each node is composed of single-dimensional column data.

Specifically, for a complete link, determining a merging sequence according to the data flow direction of each node in the complete link; according to the merging sequence, searching based on a secondary index and a dichotomy, merging a first aggregation table corresponding to a first node and a second aggregation table corresponding to a second node in the complete link to obtain an initial target aggregation table; and based on the secondary index and the dichotomy search, merging the initial target aggregation table with a third aggregation table corresponding to a third node until all aggregation tables corresponding to all nodes are merged to obtain the target aggregation table.

For example, if one complete link in the directed acyclic graph is a-b-c, the node a and the node b need to be combined to obtain an initial target aggregation table, and then the initial target aggregation table and the node c need to be combined to obtain the target aggregation table. Node a may be understood as a node consisting of single-dimensional a-column data, node b may be understood as a node consisting of single-dimensional b-column data, and node c may be understood as a node consisting of single-dimensional c-column data.

Further, based on the second-level index and the dichotomy search, merging the first aggregation table corresponding to the first node and the second aggregation table corresponding to the second node to obtain an initial target aggregation table, including: determining secondary indexes which are not required to be combined in the second node according to the data block indexes in the plurality of secondary indexes in the first aggregation table corresponding to the first node, and filtering the secondary indexes which are not required to be combined; in the searching of the row index, a dichotomy is used for searching and determining a plurality of reference secondary indexes, wherein one reference secondary index is a secondary index in the first aggregation table and one secondary index in the second aggregation table corresponds to a secondary index with smaller array; determining a detection secondary index, wherein the detection secondary index is a secondary index in the first aggregation table and a secondary index in the second aggregation table corresponding to a secondary index with larger array; traversing the reference secondary index, and finding a row shared with the reference secondary index from the detection secondary index as a shared row; merging the common lines into a secondary index to obtain a merged secondary index; and combining the obtained multiple secondary merging indexes into an initial target aggregation table.

The determining, according to the data block indexes in the plurality of secondary indexes in the first aggregation table corresponding to the first node, the secondary indexes in the second node which do not need to be combined includes: and aiming at one secondary index in the first aggregation table, taking the data block index in the secondary index as a target index, and determining the secondary index which does not comprise the target index from a plurality of secondary indexes in the second aggregation table as a secondary index which does not need to be combined.

The array size may be understood as the number of data corresponding to the secondary index, for example, if the first aggregation table includes a secondary index a, a secondary index B and a secondary index C, if the second aggregation table includes a secondary index a, a secondary index B and a secondary index C, for the secondary index a, if the secondary index a includes an index of 2 data, and the secondary index a includes an index of 1 data, the array size of the secondary index a is considered to be larger, the secondary index a may be regarded as a reference secondary index, the secondary index B may be regarded as a detection secondary index, a row common to the secondary index B and the secondary index a may be regarded as a common row, a merging secondary index may be obtained by merging the common row into one secondary index, for the secondary index B, a merging secondary index may be obtained by the above manner, for the secondary index C, and the obtained 3 secondary indexes may be combined into the initial aggregation table.

It will be appreciated that if the corresponding data sizes of the two secondary indexes are equal, any one secondary index of the two secondary indexes may be used as a reference secondary index, and the corresponding other secondary index may be used as a probe secondary index.

For example, fig. 4 is an illustration of a first flow Cheng Shi of aggregation table merging in a multi-dimensional aggregation method according to the first embodiment of the present invention, as shown in fig. 4, node a and node B merge, and since blk_idx in the secondary indexes key_a:1 is 0, secondary indexes, i.e., the dashed arrow part in the figure, corresponding to blk_idx in three secondary indexes, i.e., key_b:123, key_b:456 and key_b:789, are filtered out, and there is no need to merge with the filtered out key_b:456 and key_b:789 in node B when the key_a:1 is co-line merged; when the key_A:1 and the key_B:123 are combined, since 1 data corresponds to the key_A:1 and 2 data corresponds to the key_B:123, the key_A:1 can be used as a reference secondary index, the key_B:123 is used as a detection secondary index, the 0 th row in the row data block 0 is shared, and blk_idx in a combined secondary index key_A_B [1] [123] obtained after the combination is 0 and row_idx is 0. Because the blk_idx in the secondary index Key_A:2 is 0 and 1, the blk_idx in the Key_B:123 comprises 0, the blk_idx in the Key_B:456 comprises 1, no secondary index is required to be filtered if the blk_idx in the Key_B:789 comprises 1, therefore, the Key_A:2 needs to be combined with all secondary indexes in the node B, when the Key_A:2 is combined with the Key_B:123, as the data corresponding to the Key_A:2 and the Key_B:123 are 2, any one of the Key_A:2 and the Key_B:123 can be used as a reference secondary index, the other corresponding one is used as a detection index, and the row 1 in the shared line data block 0 of the Key_A:2 and the Key_B:123 is obtained after the combination, and the row 1 in the Key_A_B [2] [123] is the blk_idx is 0; when the key_a:2 and the key_b:456 are combined, since the key_a:2 corresponds to 2 data and the key_b:456 corresponds to 1 data, the key_b:456 can be used as a reference secondary index, the key_a:2 is used as a detection index, the row 0 in the shared line data block 1 of the key_a:2 and the key_b:456, and the blk_idx in the combined secondary index key_a_b [2] [456] obtained after the combination is 0, and the row_idx is 1. All the secondary indexes are combined according to the above combination mode, and the combination process of the other secondary indexes is not described herein. The final initial target aggregate table consists of Key_A_B1 ] [123], key_A_B2 ] [456] and Key_A_B3 ] [789 ].

In this embodiment, the process of merging the initial target aggregation table with the third aggregation table corresponding to the third node based on the second-level index and the binary search is the same as the above process, and will not be described herein.

Fig. 5 is a diagram of a second flow Cheng Shi of merging aggregation tables in a multi-dimensional aggregation method according to the first embodiment of the present invention, and fig. 5 illustrates a process of merging an initial target aggregation table, i.e. a-b, with a node c, and a specific merging manner may refer to the explanation of fig. 4, which is not repeated herein.

The first embodiment of the invention provides a multidimensional aggregation method, which comprises the steps of firstly, acquiring a plurality of data blocks from a downstream operator through a scheduling component; then, performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators in the aggregation work task assembly to obtain a plurality of single-dimensional column data, and then, permanently dropping the data in the data blocks; generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component, wherein one single-dimensional column data generates a plurality of secondary indexes; and finally, searching and executing the merging of the aggregation tables based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form one aggregation table. The method utilizes the linkage action of the data block and the secondary index, the data block can greatly reduce the use of the multi-dimensional aggregate memory, and can be dropped in time without occupying extra memory space; the secondary index can efficiently filter data in the merging process of the aggregated data, so that the performance is improved, and the optimal balance of the memory and the CPU is achieved.

Further, the multi-dimensional aggregation method provided by the embodiment of the invention further comprises the following steps: and taking out the data of the permanent disk according to the secondary index in the target aggregation table by the executor operator, and outputting the data to an upstream operator.

In this embodiment, after the target aggregation table is obtained, corresponding data may be obtained from the local disk according to the second-level index in the target aggregation table, and the obtained data may be output to the upstream operator.

Example two

Fig. 6 is a schematic structural diagram of a multi-dimensional aggregation system according to a second embodiment of the present invention, where the system is applicable to a case of performing multi-dimensional aggregation analysis on a distributed OLTP service, and is generally integrated on an electronic device as a software system.

As shown in fig. 6, the system includes: the scheduling component 110, the aggregate job task component 120, and the aggregate index component 130, the aggregate job task component 120 being coupled to the scheduling component 110 and the aggregate index component 130, respectively.

A scheduling component 110 for obtaining a plurality of data blocks from a downstream operator;

the aggregation job task module 120 is configured to perform single-dimensional aggregation calculation on the plurality of data blocks through a plurality of executor operators to obtain a plurality of single-dimensional column data, and then persist data in the plurality of data blocks;

An aggregation index component 130, configured to generate secondary indexes corresponding to the plurality of single-dimensional column data, where one single-dimensional column data generates a plurality of secondary indexes;

the aggregate job task component 120 is further configured to search and execute, by the plurality of executor operators, the aggregate table based on the secondary indexes and the dichotomy, and combine the aggregate tables to obtain a plurality of target aggregate tables, where the aggregate table is a combination of a series of secondary indexes, and a plurality of secondary indexes generated by one single-dimensional column data form one aggregate table.

In this embodiment, the system first obtains a plurality of data blocks from a downstream operator through the scheduling component 110; then, performing single-dimensional aggregation calculation on the data blocks through a plurality of executor operators by using an aggregation work task module 120 to obtain a plurality of single-dimensional column data, and then, persisting the data in the data blocks; generating secondary indexes corresponding to the plurality of single-dimensional column data through an aggregation index component 130, wherein one single-dimensional column data generates a plurality of secondary indexes; finally, the aggregation table is obtained by searching and merging the aggregation tables based on the two-level indexes and the dichotomy through a plurality of executor operators in the aggregation work task module 120, wherein the aggregation table is a combination of a series of two-level indexes, and a plurality of two-level indexes generated by one single-dimension column data form one aggregation table.

The embodiment provides a multi-dimensional aggregation device, which can provide a multi-dimensional aggregation method with real-time and optimal balance between memory and a central processing unit.

Further, the aggregate job task component 120 includes a computing unit to: the data blocks are placed into an aggregation work queue, a plurality of multi-dimensional combinations are determined through the aggregation work task module, and one multi-dimensional combination is composed of data corresponding to at least one column; aiming at a target data block, generating a corresponding directed acyclic graph through the aggregation work task module based on a plurality of multidimensional combinations determined by the target data block; traversing the plurality of data blocks through the aggregation work task assembly, and executing single-dimensional aggregation calculation on data corresponding to the first node of a plurality of complete links in the directed acyclic graph to obtain a plurality of single-dimensional column data.

Further, the aggregate job task component 120 includes a merging unit for: and performing aggregation table merging on each node in each complete link in the directed acyclic graph based on the secondary index and the dichotomy search to obtain a plurality of target aggregation tables, wherein each node is composed of single-dimensional column data.

Further, the merging unit is specifically configured to: determining a merging sequence according to the data flow direction of each node in one complete link aiming at the complete link; according to the merging sequence, searching based on a secondary index and a dichotomy, merging a first aggregation table corresponding to a first node and a second aggregation table corresponding to a second node in the complete link to obtain an initial target aggregation table; and based on the secondary index and the dichotomy search, merging the initial target aggregation table with a third aggregation table corresponding to a third node until all aggregation tables corresponding to all nodes are merged to obtain the target aggregation table.

Further, the device also comprises an output module for: and taking out the data of the permanent drop disc according to the secondary index in the target aggregation table, and outputting the data to an upstream operator.

The multi-dimensional aggregation device can execute the multi-dimensional aggregation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

Fig. 7 is a schematic structural diagram of a multi-dimensional aggregation system according to a third embodiment of the present invention, where the third embodiment is provided as an example embodiment, and the multi-dimensional aggregation system may perform the multi-dimensional aggregation method according to any of the embodiments of the present invention.

As shown in fig. 7, the Dispatcher, i.e., the dispatch component, obtains the data block from the Child Executor, i.e., the downstream operator; placing the acquired data blocks into Grouping Worker Queue, namely an aggregation work queue; grouping worker scheduler, an aggregate job task component can obtain the data block from Grouping Worker Queue to generate a corresponding directed acyclic graph; any one work of works 0, 1, 2 and 3 executes single-dimensional aggregation calculation to obtain a plurality of single-dimensional column data, and persists data in a plurality of data blocks; grouping Worker Queue generating secondary indexes corresponding to a plurality of single-dimensional column data, searching and executing aggregation table combination by work based on the secondary indexes and a dichotomy to obtain an aggregation table, namely a plurality of target aggregation tables, and sending the aggregation table to Grouping Map Queue in Grouping Worker Queue; and the work takes out the data of the permanent drop disc according to the secondary index in the aggregation table, and outputs the data to the Parent Executor, namely an upstream operator.

The multi-dimensional aggregation system provided by the third embodiment of the invention can realize real-time multi-dimensional aggregation with optimal balance of the memory and the central processing unit.

Further, the components included in the multi-dimensional aggregation system and the number of functions of each component are shown in table 1:

TABLE 1

Example IV

Fig. 8 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as a multi-dimensional aggregation method.

In some embodiments, the multi-dimensional aggregation method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more of the steps of the multi-dimensional aggregation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the multi-dimensional aggregation method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A multi-dimensional aggregation method, the method comprising:

placing the plurality of data blocks into an aggregation work queue through a plurality of executor operators in an aggregation work task module, and determining a plurality of multi-dimensional combinations through the aggregation work task module, wherein one multi-dimensional combination is composed of data corresponding to at least one column; aiming at a target data block, generating a corresponding directed acyclic graph through the aggregation work task module based on a plurality of multidimensional combinations determined by the target data block; traversing the plurality of data blocks through the aggregation work task assembly, performing single-dimensional aggregation calculation on data corresponding to the first node of a plurality of complete links in the directed acyclic graph to obtain a plurality of single-dimensional column data, and permanently dropping the data in the plurality of data blocks;

searching and executing the merging of the aggregation tables based on the secondary index and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the method comprises the following steps: determining a merging sequence according to the data flow direction of each node in one complete link aiming at the complete link; according to the merging sequence, searching based on a secondary index and a dichotomy, merging a first aggregation table corresponding to a first node and a second aggregation table corresponding to a second node in the complete link to obtain an initial target aggregation table; based on the secondary index and the dichotomy, merging the initial target aggregation table with a third aggregation table corresponding to a third node until all aggregation tables corresponding to all nodes are merged to obtain a target aggregation table; the aggregation table is a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form the aggregation table.

2. The method of claim 1, wherein a single-dimensional column data generates a plurality of secondary indexes, each secondary index having a corresponding key-value pair, a key in the key-value pair representing a single-dimensional data, a value in the key-value pair representing one of the single-dimensional column data; one secondary index includes a data block index identifying the data block in which the one single-dimensional column data is located and a row index identifying the column in which the one single-dimensional column data is located.

3. The method of claim 1, wherein performing the aggregation table merge based on the secondary index and the binary search results in a plurality of target aggregation tables, comprising:

and performing aggregation table merging on each node in each complete link in the directed acyclic graph based on the secondary index and the dichotomy search to obtain a plurality of target aggregation tables, wherein each node is composed of single-dimensional column data.

4. The method of claim 1, wherein merging the first aggregation table corresponding to the first node and the second aggregation table corresponding to the second node based on the secondary index and the binary search to obtain the initial target aggregation table comprises:

determining secondary indexes which are not required to be combined in the second node according to the data block indexes in the plurality of secondary indexes in the first aggregation table corresponding to the first node, and filtering the secondary indexes which are not required to be combined;

in the searching of the row index, a dichotomy is used for searching and determining a plurality of reference secondary indexes, wherein one reference secondary index is a secondary index in the first aggregation table and one secondary index in the second aggregation table corresponds to a secondary index with smaller array;

Determining a detection secondary index, wherein the detection secondary index is a secondary index with a larger array corresponding to one secondary index in the first aggregation table and one secondary index in the second aggregation table;

traversing the reference secondary index, and finding a row shared with the reference secondary index from the detection secondary index as a shared row;

merging the common lines into a secondary index to obtain a merged secondary index;

and combining the obtained multiple secondary merging indexes into an initial target aggregation table.

5. The method according to claim 1, wherein the method further comprises:

and taking out the data of the persistent landing disc according to the secondary index in the target aggregation table through the executor operator, and outputting the data to an upstream operator.

6. The system is characterized by comprising a scheduling component, an aggregation work task component and an aggregation index component, wherein the aggregation work task component is respectively connected with the scheduling component and the aggregation index component;

the aggregation work task module comprises a calculation unit, a data block management unit and a data block management unit, wherein the calculation unit is used for placing the data blocks into an aggregation work queue through a plurality of executor operators, determining a plurality of multi-dimensional combinations through the aggregation work task module, and one multi-dimensional combination is composed of data corresponding to at least one column; aiming at a target data block, generating a corresponding directed acyclic graph through the aggregation work task module based on a plurality of multidimensional combinations determined by the target data block; traversing the plurality of data blocks through the aggregation work task assembly, performing single-dimensional aggregation calculation on data corresponding to the first node of a plurality of complete links in the directed acyclic graph to obtain a plurality of single-dimensional column data, and permanently dropping the data in the plurality of data blocks; the aggregation index component is used for generating secondary indexes corresponding to the plurality of single-dimensional column data, and generating a plurality of secondary indexes by one single-dimensional column data;

The aggregation work task assembly is further used for searching and executing aggregation table combination based on the secondary indexes and the dichotomy through the plurality of executor operators to obtain a plurality of target aggregation tables, wherein the aggregation tables are a combination of a series of secondary indexes, and a plurality of secondary indexes generated by single-dimension column data form an aggregation table;

the aggregation work task assembly comprises a merging unit, a merging unit and a processing unit, wherein the merging unit is used for determining a merging sequence according to the data flow direction of each node in one complete link; according to the merging sequence, searching based on a secondary index and a dichotomy, merging a first aggregation table corresponding to a first node and a second aggregation table corresponding to a second node in the complete link to obtain an initial target aggregation table; and based on the secondary index and the dichotomy search, merging the initial target aggregation table with a third aggregation table corresponding to a third node until all aggregation tables corresponding to all nodes are merged to obtain the target aggregation table.

7. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-dimensional aggregation method of any one of claims 1-5.

8. A computer readable storage medium storing computer instructions for causing a processor to implement the multi-dimensional aggregation method of any one of claims 1-5 when executed.