CN116975052A

CN116975052A - Data processing method and related equipment

Info

Publication number: CN116975052A
Application number: CN202310546041.XA
Authority: CN
Inventors: 石志林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-15
Filing date: 2023-05-15
Publication date: 2023-10-31

Abstract

The embodiment of the application provides a data processing method and related equipment, wherein the method comprises the following steps: acquiring N data tables to be connected, and acquiring the processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of the corresponding data table and comprises a memory area and a disk area; wherein N is an integer greater than 1; determining the connection mode of the N data tables according to the processing states of the N data tables; the connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area; and carrying out connection calculation on the N data tables according to the determined connection mode. According to the embodiment of the application, the efficiency and performance of the connection calculation of a plurality of data tables can be effectively improved.

Description

Data processing method and related equipment

Technical Field

The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer device, a computer readable storage medium, and a computer program product.

Background

With the development of the internet, mass data generated in the internet can be effectively managed and accessed through database technology. In the database technology, massive data can be stored in a database through data table association so as to obtain data required by corresponding services more efficiently in a service scene. For example, in the context of data query, if the queried data relates to multiple data tables, then the multiple data tables are connected to obtain the final query result quickly. However, when the data magnitude is large and the number of data tables is large, the performance of connection calculation between the plurality of data tables is poor, and the calculation efficiency is not high enough.

Disclosure of Invention

The embodiment of the application provides a data processing method and related equipment, which can improve the efficiency and the computing performance of connection computing.

In one aspect, an embodiment of the present application provides a data processing method, where the method includes:

acquiring N data tables to be connected, and acquiring the processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of the corresponding data table and comprises a memory area and a disk area; wherein N is an integer greater than 1;

determining the connection mode of the N data tables according to the processing states of the N data tables; the connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area;

and carrying out connection calculation on the N data tables according to the determined connection mode.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the acquisition unit is used for acquiring N data tables to be connected and acquiring the processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of the corresponding data table and comprises a memory area and a disk area; wherein N is an integer greater than 1;

The processing unit is used for determining the connection mode of the N data tables according to the processing states of the N data tables; the connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area;

and the processing unit is used for carrying out connection calculation on the N data tables according to the determined connection mode.

In one aspect, an embodiment of the present application provides a computer apparatus, including:

a processor adapted to execute a computer program;

a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements a data processing method as described above.

In one aspect, embodiments of the present application provide a computer readable storage medium storing a computer program loaded by a processor and performing a data processing method as described above.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program or computer instructions which, when executed by a processor, implement the above-described data processing method.

In the embodiment of the application, N data tables to be connected can be obtained, and the processing states of the N data tables can be obtained; n is an integer greater than 1, each data table corresponds to a respective processing partition, and the processing partition is used for storing data of the corresponding data table, and comprises a memory area and a disk area. Based on the processing partition allocated for each data table, separate storage space may be provided for the different data tables to store the data of the different data tables independently. The memory area and the disk area contained in the processing partition can also provide different storage spaces for the data in the same data table, so that the data in the corresponding processing partition can be selectively processed when the connection calculation is performed on the N data tables later, and the processing efficiency is improved. The processing state of the data table can be used for determining the connection mode of the N data tables, namely: and determining the connection mode of the N data tables according to the processing states of the N data tables. The connection mode determined by the processing states of the comprehensive N data tables is a better connection mode capable of connecting all the data tables in the current processing state. The connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area. Any of the above connection modes can correspond to the processing states of the N data tables, the processing states of the N data tables can be monitored in real time, when the processing states of the N data tables are monitored to change, the connection modes can also change, and then the connection modes of the N data tables can be timely adjusted, so that the N data tables can be optimally connected and calculated later, and even if more data can not be buffered to a memory, the connection calculation of the N data tables can be guaranteed to be executed all the time, and further the connection calculation performance can be improved. After determining the connection mode, connection calculation can be performed on the N data tables according to the determined connection mode. Because the connection mode relates to the connection of the corresponding processing partition to the processing partition, the data of each data table stored in the memory area and/or the disk area can be selected to be combined based on the connection mode. The connection calculation performed in the determined connection mode can better cope with the problem of memory overflow processing, and the connection calculation of a plurality of data tables is realized more efficiently.

Drawings

FIG. 1 is a block diagram of a data processing system in accordance with an illustrative embodiment of the present application;

FIG. 2 is a flow chart of a method for processing data according to an exemplary embodiment of the present application;

FIG. 3a is a schematic diagram of data in a data table stored in a corresponding memory area and disk area according to an exemplary embodiment of the present application;

FIG. 3b is a diagram illustrating a new data being written to a memory area according to an exemplary embodiment of the present application;

FIG. 3c is a schematic diagram of an additional data arrival provided by an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram of a memory area to memory area connection according to an exemplary embodiment of the present application;

FIG. 4b is a schematic diagram illustrating a connection between a disk area and a memory area according to an exemplary embodiment of the present application;

FIG. 4c is a schematic diagram of a volume-to-volume connection provided by an exemplary embodiment of the present application;

FIG. 5a is a schematic diagram of a junction tree provided by an exemplary embodiment of the present application;

FIG. 5b is a schematic diagram of a process for optimizing a connection tree according to an exemplary embodiment of the present application;

FIG. 5c is a schematic diagram of a connection calculation between two dynamic tables provided by an exemplary embodiment of the present application;

FIG. 5d is a schematic diagram of a connection between two dynamic tables provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.

The term "at least one" in the present application means one or more, and the meaning of "a plurality of" means two or more; for example: at least one node means the one, two or more nodes, and a plurality of data tables means two or more data tables.

The application provides a data processing scheme, which relates to a data processing system, a data processing method and related equipment, wherein for N (N > 1) data tables to be connected, the scheme can determine the connection mode of the N data tables according to the processing state of each data table, and then can perform connection calculation on the N data tables according to the determined connection mode. In this process, determining the connection based on the processing state of the data table may enable the connection calculation to match the processing state of the data table. In the processing states of the N data tables, corresponding connection modes can be selected, connection calculation of the plurality of data tables is not interrupted due to the change of the processing states of the data tables, new connection modes are determined again, and more efficient connection calculation can be achieved. In addition, each type of connection mode relates to processing data stored in a disk area and a memory area by a data table, and the data in the disk area and/or the memory area is connected based on the determined connection mode, so that the performance of connection calculation can be better improved, and the problems of processing memory overflow and the like can be solved.

The architecture of a data processing system provided by embodiments of the present application will be described below with reference to the accompanying drawings.

With reference now to FIG. 1, an architecture diagram of a data processing system is depicted in accordance with an illustrative embodiment of the present application. As shown in fig. 1, the data processing system includes a database 101 and a computer device 102; the database 101 may establish a communication connection with the computer device 102 by wire or wirelessly. Wherein the computer device 102 is configured to perform a data processing procedure; database 101 is used to provide data support for data processing by computer device 102.

The database 101 may be a local database of the computer device 102 or a cloud database capable of establishing a connection with the computer device 102, according to the deployment location division. According to the attribute division, the database 101 may be a public database, i.e., a database opened to all computer devices; but may also be a private database, i.e., a database that is open only to specific computer devices, such as computer device 102. The database 101 may have a data table stored therein. Depending on whether the data table changes over time, the data table may contain one or more of a dynamic table and a static table; the dynamic table refers to a data table which changes with time, generally a real-time change table in stream calculation, and can be used for storing real-time data; static tables refer to tables of data that do not change over time, and are typically used to store batch data. According to associations between data tables, the data tables may contain one or more of fact tables and dimension tables. A fact table is a data table that measures traffic, and the data in the fact table is used to describe fact information. Illustratively, table 1 below is an example of a fact table.

TABLE 1 facts table

Sales order number	Product ID	Customer ID	Order date	Sales quantity	Sales amount
						1	1001	101	2022-01-01	10	1000
2	1002	102	2022-01-02	5	500
						3	1003	103	2022-01-03	15	1500

The fact table shown in table 1 is a sales fact table, which is a main data table for recording sales information. Wherein the names of the columns and their meanings are as follows: (1) sales order number: a unique identifier of the sales order; (2) product ID: the ID of the product sold; (3) customer ID: ID (4) order date of customer purchasing product: date of order generation (5) sales amount: quantity of products sold (6) sales amount: the total amount of sales at the corresponding sales amount. The data for each row represents a fact.

A dimension table is a data table for describing one attribute in a fact. In the example of a fact table as described above, the fields of each column may correspond to a dimension table, such as a dimension in terms of order date, dimension of product ID, dimension of customer ID, and so forth. Each dimension in the fact table has a dimension table associated with it, and the field that establishes the association between the fact table and the dimension table may be referred to as a foreign key field and may be associated with a primary key field in the fact table. Illustratively, the primary key field in the fact table is a compound field, and includes a sales order number, a product ID, and a client ID, and then the dimension table using the sales order number as the primary key is the foreign key field of the fact table associated with the dimension table. Illustratively, table 2 below is an example of one dimension table.

Table 2 dimension table

Product ID	Product name	Category(s)	Price of	Stock quantity
					1001	xx 13	Mobile phone	7999	500
1002	Xx Air	Notebook computer	8999	200
					1003	xx Pro	Flat plate	5999	300

As described above, the dimension table includes a field for describing specific product information in addition to the field "product ID" associated with the fact table, including: product name, category, price, inventory, etc. The fact table (or dimension table) may be any of a dynamic table or a static table according to whether it changes with time.

The data processing flow performed by the computer device 102 may generally include: (1) acquiring N data tables to be connected and processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of the corresponding data table and comprises a memory area and a disk area; wherein N is an integer greater than 1. The N data tables may be obtained from a database, or may be sent to the computer device 102 by another computer device. The N data tables obtained may be dynamic tables, or static tables, or both, which are not limited in this aspect of the present application. The processing state of the data table may be determined in real time by the computer device 102 based on the processing of the data in the data table. The processing conditions include, but are not limited to, processing speed, processing progress, and the like. For example, if the processing speed of the data is less than the preset speed threshold, the processing state of the data table may be determined to be a blocking state. (2) Determining the connection mode of the N data tables according to the processing states of the N data tables; the connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area. The processing states of different ones of the N data tables may be the same or different. If all the N data tables are dynamic tables, the processing states of the data tables are continuously monitored because new data may be continuously added to the dynamic tables, and the processing states of the data tables may be changed in different time periods based on the processing conditions and the changing conditions (such as adding, deleting, etc.) of the data in the data tables. Any one connection mode can be determined from the above several connection modes according to the processing states of the N data tables. For example, the N data tables are all in a blocking state, and the connection mode may be determined to be the connection of the disk area to the memory area. (3) And carrying out connection calculation on the N data tables according to the determined connection mode. In the process, the connection calculation of the N data tables is a process of merging various data of the data tables according to corresponding rules. And when the determined connection modes are different, corresponding differences exist for the objects combined in the N data. In addition, when the data table is a dynamic table, the connection calculation for N data tables is continuously performed, and the above steps (1) to (3) may be repeatedly performed.

Computer device 102 may include either or both of a terminal device and a server, the terminal device including, but not limited to: the application is not limited to smart phones, tablet computers, intelligent wearable devices, intelligent voice interaction devices, intelligent home appliances, personal computers, vehicle-mounted terminals, intelligent cameras and other devices. The present application is not limited with respect to the number of terminal devices. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), basic cloud computing services such as big data and artificial intelligent platform, but is not limited thereto. The present application is not limited with respect to the number of servers. The computer device 102 may be used as a data processing device to perform the above data processing flows. If the computer device 102 comprises a plurality, processing nodes may be distributed among the plurality of computer devices, where a processing node may be a device or software in a computer device, and each processing node may constitute a distributed processing engine, such as a Flink engine. The data processing flows described above may also be performed by any of the processing nodes in the computer device.

In addition, the data processing flow may involve cloud computing (cloud computing) which refers to a delivery and usage mode of an IT infrastructure, and refers to obtaining required resources in an on-demand and easily-expandable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like. With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Sufficient computing power can be provided for multi-table connection through cloud computing, so that computing efficiency is improved.

Therefore, the data processing system provided by the application can realize multi-table connection, and in a specific implementation process, the computer equipment can acquire N data tables to be connected and acquire the processing states of the N data tables; n is an integer greater than 1, each data table corresponds to a respective processing partition, and the processing partition is used for storing data of the corresponding data table, and comprises a memory area and a disk area. Based on the processing partition allocated for each data table, separate storage space may be provided for the different data tables to store the data of the different data tables independently. The memory area and the disk area contained in the processing partition can also provide different storage spaces for the data in the same data table, so that the data in the corresponding processing partition can be selectively processed when the connection calculation is performed on the N data tables later, and the processing efficiency is improved. The processing state of the data table can be used for determining the connection mode of the N data tables, namely: and determining the connection mode of the N data tables according to the processing states of the N data tables. The connection mode determined by the processing states of the comprehensive N data tables is a better connection mode capable of connecting all the data tables in the current processing state. The connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area. Any of the above connection modes can correspond to the processing states of the N data tables, the processing states of the N data tables can be monitored in real time, when the processing states of the N data tables are monitored to change, the connection modes can also change, and then the connection modes of the N data tables can be timely adjusted, so that the N data tables can be optimally connected and calculated later, and even if more data can not be buffered to a memory, the connection calculation of the N data tables can be guaranteed to be executed all the time, and further the connection calculation performance can be improved. After determining the connection mode, connection calculation can be performed on the N data tables according to the determined connection mode. Because the connection mode relates to the connection of the corresponding processing partition to the processing partition, the data of each data table stored in the memory area and/or the disk area can be selected to be combined based on the connection mode. The connection calculation performed in the determined connection mode can better cope with the problem of memory overflow processing, and the connection calculation of a plurality of data tables is realized more efficiently.

The data processing system and the data processing method provided by the embodiment of the application can be applied to the following business scenes:

(one) Multi-Table query scenarios

In a multi-table lookup scenario, a connection of multiple data tables is involved. If the data table is a dynamic table, then the query to the dynamic table is a continuous query that does not terminate, the resulting table is also a dynamic table, and the result table is continually updated according to the query to reflect changes in its input table (i.e., the original dynamic table). In the continuous query process, connection calculation is performed between the dynamic tables to obtain a result table, the connection between the dynamic tables may be simply referred to as a dynamic table connection (join), and association fields between two dynamic tables are equal, so that a new dynamic data set may be formed. The data processing method provided by the application can determine the connection mode based on the current processing state of the dynamic table, so that the connection can be more efficiently performed according to the connection mode.

(II) real-time report analysis scene

Real-time report analysis scenarios, including but not limited to: and analyzing real-time reports under advertisement recommendation scenes, live broadcast scenes and the like. In an advertisement recommendation scenario, various advertisement data may be stored in different dynamic tables. Wherein the advertisement data includes, but is not limited to: advertisement ID, browse amount, click amount, exposure rate, etc. By linking the various dynamic tables for analysis of the various advertisement data, better advertisement recommendations may be made. In addition, in the live broadcast scene, the live broadcast real-time billboard can be analyzed, and in the process of analyzing the live broadcast real-time billboard, the multi-table connection query is also involved. In the above-mentioned scene, the data processing method provided by the application can be adopted to carry out high-efficiency connection calculation, so as to rapidly obtain the analysis result.

(III) financial wind control scene

In a financial wind control scenario, real-time monitoring, such as log anomaly monitoring, is required for financial data. In the process of monitoring data in real time, data in different data tables may need to be queried, thus connection query between a plurality of data tables is involved. By adopting the data processing method provided by the application, the connection mode can be determined according to the processing state of the data table, and then the data tables are more efficiently connected and calculated according to the connection mode, so that the data acquisition speed is improved.

It should be noted that the above service scenario is merely exemplary, and the data processing scheme provided by the present application is not limited thereto, and may be applied to, for example, a game scenario, an e-commerce sales scenario, and the like. Under the above scenario, based on the improvement of the connection calculation efficiency, the delay of data processing (such as the delay of data real-time monitoring and real-time report analysis and calculation) in the corresponding field can be effectively reduced, and the method has good real-time performance. Under a real-time data processing scene, the high-efficiency processing of the connection calculation can improve the throughput of the real-time calculation, the connection calculation when the magnitude of the real-time data is large and the number of the data tables is large is dealt with, the speed of the data pipeline for processing the data by a certain node can be guaranteed to keep up with the speed of generating the data at the upstream of the node, and the back pressure is avoided, so that the back pressure is prevented from being transmitted to the data source from the node to the upstream, and the intake speed of the data source is influenced. Back pressure here refers to a phenomenon in which in real-time data processing, a data pipe (channel for transmitting data) generates data upstream of a certain node at a speed greater than the speed at which the node processes the data. In addition, the calculation performance based on real-time calculation is effectively improved, and the dynamic table is connected and calculated more efficiently.

The following describes a data processing method provided by an embodiment of the present application.

Fig. 2 is a flow chart of a data processing method according to an exemplary embodiment of the application. The data processing method may be performed by a computer device, such as the computer device in the data processing system shown in fig. 1, and the data processing method may comprise the following steps S201-S204.

S201, acquiring N data tables to be connected, and acquiring the processing states of the N data tables.

Wherein N is an integer greater than 1. The computer device may obtain at least two data tables to be connected, and in one possible implementation, the corresponding data table may be obtained from the database according to the data table identifier included in the connection indication information. Wherein the data table identifier may be used to uniquely identify the data table, and the data table identifier may be a letter, a number, a character string, or the like, which is not limited in this regard by the present application. The connection indication information may be used to indicate connection conditions between N data tables to be connected, where the connection indication information includes a data table identifier of each data table. Illustratively, the connection indication information is where a.key1=b.key and a.key2=c.key and d.key1=e.key and d.key2=f.key. Wherein A, B, C, D, E and F, etc. are identified for the data table.

Each data table in the N data tables corresponds to a respective processing partition, and the processing partitions are used for storing data of the corresponding data table. The computer device may allocate a corresponding processing partition to each data table in advance, where the processing partitions corresponding to different data tables are independent of each other, so that the storage spaces of the data in the different data tables are independent. For example, data table a corresponds to processing partition a, which may be used to store data in data table a, and data table B corresponds to processing partition B, which may be used to store data in data table B. Processing partition a and processing partition b are separate and distinct partitions provided by the computer device.

The processing partition comprises a memory area and a disk area. The data in the data table is stored in the corresponding memory area, and if the data table still has residual data but the memory area cannot be written (for example, the memory is full), the residual data can be written into the disk area. The volume is empty before no data is written in the volume. When the data table is a dynamic table, for the newly added data in the dynamic table, if the memory area corresponding to the dynamic table cannot write the newly added data, the newly added data can be refreshed into the disk area, namely, the newly added data is directly stored in the disk area. In the present application, if the data table is a dynamic table, the data table may also be referred to as an input stream, and there is newly added data in the dynamic table to be processed, which indicates that the new data arrives at the input stream. In one implementation, each data table may be stored in the form of data blocks in a respective memory/disk region, one data block may contain one or more data in the data table. Further, after the connection calculation is performed on the data in the data block where the data stored in the memory area is located, the data in the data block can also be written into the disk area, so that the memory space of the memory area can be free to store new data.

The processing state of the data table means: and a state that whether the newly added data in the data table is written into the corresponding memory area or whether the data in the memory area corresponding to the data table is processed or not. The processing state of the data table may comprise any of the following: blocking state, non-blocking state, depletion state. The blocking state is a state that newly added data of the data table cannot be written into a corresponding memory area; the non-blocking state is a state in which newly added data of the data table can be written into a corresponding memory area; the exhaustion state is a state that no new data is added in the data table, and the data of the data table stored in the corresponding memory area is processed. The processing states of the different data tables may be the same or different, e.g., processing state of data table a is a blocking state and processing state of data table B is a non-blocking state. For another example, the processing states of data table C and data table D are both depleted states.

In one embodiment, the determination of whether the data table is in a blocked state may be determined based on whether there is new data in the data table for a period of time, or whether the processing speed of the data in the data table is less than a speed threshold. The following describes a specific implementation of the processing state of acquiring the data table, taking any one of the N data tables as an example.

Here, a brief description will be given of related contents referred to in judging the processing state of the data table. Any one data table in the N data tables is expressed as a data table A, and the data table A corresponds to a processing partition a, wherein the processing partition a comprises a memory area a1 and a disk area a2; the data table A is a dynamic table, the memory hash table corresponding to the data table A is stored in the memory area a1, and the memory hash table corresponding to the data table A comprises one or more hashesValues, each hash value representing one data stored in the memory area a1 in the data table a; the newly added data of the data table A is data i ₁ . In one implementation, each hash value (i.e., hash value) in the memory hash table corresponding to data table a is: and carrying out hash calculation on corresponding data stored in the memory area in the data table A by adopting a hash function. For example, the data x in the data table a is stored in the memory area of the data table a, and then the hash value of the data x is obtained by performing a hash calculation using a hash function. Since all or part of the data in the data table a may be stored in the memory area, the memory hash table may include hash values corresponding to all or part of the data in the data table a, respectively. If the internal hash table includes hash values corresponding to part of the data in the data table a, the internal hash table may also be regarded as a portion of the internal memory where the target hash table corresponding to the data table resides, where the target hash table includes hash values corresponding to the data in the data table a, and is a complete hash table.

Based on this, the determination of the processing state of the data table a may include the following manner one, manner two, and manner three.

The newly added data of the first mode and the data table A is data i ₁ . When the data table a is a dynamic table, there may be corresponding data continuously reaching the data table a, and the data is the data to be newly added to the data table a, i.e. the newly added data. Other computer devices or other processing nodes in the computer device can calculate the hash value of the newly added data, and send the calculated hash value to the processing node in the computer device for data processing.

Based on this, the computer device may determine the processing state of the data table a according to whether the hash value can be stored in the memory area: when receiving data i ₁ Detecting the remaining memory space of the memory area a1 when the hash value of (a) is detected; if the remaining memory space of the memory area a1 is greater than or equal to the data i ₁ The storage space required by the hash value of (2) determines that the processing state of the data table A is a non-blocking state; if the remaining memory space of the memory area a1 is smaller than the data i ₁ The storage space required for the hash value of (2) then determining the data tableThe processing state of a is a blocking state.

In particular, the computer device is receiving data i ₁ Can detect whether there is enough memory space in the memory area a1 for storing the data i ₁ Is used to generate the hash value of (a). Specifically, it can be determined whether the remaining memory space of the memory area a1 is greater than or equal to the data i ₁ Storage space required for the hash value of (a). The remaining memory space in memory area a1 is greater than or equal to data i ₁ When the hash value of (a) requires a storage space, it is indicated that there is enough memory space in the memory area a1 for storing the data i ₁ Then the processing state of data table a may be determined to be a non-blocking state. Conversely, the remaining memory space in memory area a1 is less than data i ₁ When the hash value of (a) requires a storage space, it is indicated that there is insufficient memory space in the memory area a1 for storing the data i ₁ Then the processing state of data table a may be determined to be a blocking state. In addition, the data i ₁ Can be written to disk b1 and data i ₁ The hash value of (c) may be stored in volume b 1.

A second mode is to acquire the processing speed of the data table a stored in the memory area a 1; if the processing speed is greater than or equal to a preset speed threshold, determining that the processing state of the data table A is a non-blocking state; if the processing speed is less than the preset speed threshold, determining that the processing state of the data table A is a blocking state.

The processing speed of the data table a stored in the memory area a1 can be expressed by the amount of data that is separated from the memory area a1 per unit time. For example, the amount of data leaving in the memory area a1 per second is 10, and the processing speed of the computer device on the data in the data table a is 10 per second. The processing state of data table a may then be determined based on a comparison of the processing speed to a predetermined speed threshold. The preset speed threshold may be set empirically or determined based on historical processing speeds of the computer device. The preset speed threshold is a speed threshold which ensures that the processing speed of the data can keep pace with the generation speed of the data. The data generating speed refers to the data amount received by the computer device in unit time, namely, the upstream data generating speed, wherein the upstream data refers to other processing nodes or other computer devices which have a data transmission relation with the processing nodes in the computer device.

Specifically, if the processing speed is greater than or equal to the preset speed threshold, it indicates that the processing speed of the data in the memory area a1 reaches the required processing speed, the processing speed of the processing node on the data can keep pace with the generating speed of the data upstream of the processing node, at this time, it can be determined that the processing state of the data table a is in a non-blocking state, and the speed of generating the data upstream is not affected. Otherwise, if the processing speed is less than the preset speed threshold, it indicates that the processing speed of the data in the memory area a1 does not reach the required processing speed, and the processing speed of the processing node may not follow the speed of generating data upstream of the processing node, so that a back pressure may occur, and it may be determined that the processing state of the data table a is a blocking state. Back pressure may be transferred upstream from the node to the data source, thereby reducing the rate of ingestion of the data source, which may eventually cause the data table a to change from a blocked state to a non-blocked state as the rate of ingestion decreases.

In one implementation, the determination of the processing state may also be implemented in combination with the first and second modes. Illustratively, when the computer device continuously receives the hash value of the new data, it indicates that the upstream data stream continuously has data entering, and when the computer device has enough storage space to store the hash value of the new data, if the processing speed of the data stored in the memory area of the data table is less than the preset speed threshold, it can still be determined that the data table a is in the blocking state.

A third mode is to acquire the processing progress of the data table a stored in the memory area a 1; if the processing progress indicates that the data of the data table a stored in the memory area a1 is processed and no new data is added in the data table a, determining that the processing state of the data table a is a depletion state.

The processing progress of the data table a stored in the memory area a1 may be used to indicate whether all the data stored in the memory area a1 corresponding to the data table a is processed. If the processing progress indicates that the data of the data table a stored in the memory area a1 is processed, and no new data is added in the data table a, it indicates that the data in the memory area a1 is subjected to connection calculation, and no new data arrives in the data table a, so that the processing state of the data table a can be determined to be a depletion state. The purge phase may be entered during the depletion state to purge some data that was missed by the process, such as data that was initially written to disk because the memory space of the memory region was insufficient to store the hash value of the data.

It should be noted that, if the processing progress indicates that the data of the data table a stored in the memory area a1 is not processed, or that the data table a has new data, which indicates that the data in the memory area a1 has not been subjected to the connection calculation, or that the new data arrives at the data table a, it may be determined that the data table a is not in a depleted state, and further, the processing state of the data table a may be further determined in the first or second mode.

For any one of the N data tables, the processing state of the corresponding data table can be obtained by adopting the contents described in the first to third modes, so that the connection mode of each data table can be conveniently determined according to the processing state of each data table. By adopting the mode, the processing state of the data table can be accurately judged according to one or more of the processing conditions (such as processing progress, processing speed and the like) of the data in the data table and the new condition of the data, and an accurate basis is provided for the determination of the subsequent connection mode.

S202, determining the connection mode of the N data tables according to the processing states of the N data tables.

Wherein, the connection mode can comprise at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area.

When the operation requirement of the multi-table connection is finished, in order to reduce the reading times of the connection calculation to the data table to improve the performance, the connection calculation to the data table can be finished in the memory area. During the connection calculation, if the data table is too large, it may not be completely buffered in the memory area, which may cause overflow of the processing memory. At this time, a suitable connection mode can be determined based on the processing state of each data table to be connected, and the memory area and the disk area are combined to cope with the problem of processing memory overflow.

The following describes the implementation of determining the connection manner of each data table according to the processing state of each data table, and may specifically include the implementation shown in the following (1) - (3).

(1) If at least one data table in the N data tables is in a non-blocking state, determining that the connection mode of the N data tables is the connection of the memory area to the memory area.

At least one of the N data tables is in a non-blocking state, and the following situations can be included: (1) all of the N data tables are in a non-blocking state, (2) one portion of the data tables are in a non-blocking state and another portion of the data tables are in other processing states (e.g., blocking state or exhaustion state). When a data table processing non-blocking state exists, the newly-added data of the data table in the non-blocking state in the N data tables can be written into the memory area of the corresponding data table, and the newly-added data written into the memory area can be processed at the moment, so that the connection mode of the N data tables can be the connection of the memory area to the memory area.

For example, if N is 2 and one of the 2 data tables is in a non-blocking state, the connection manner of the two data tables may be determined as the connection between the memory areas (also may be simply referred to as a memory join memory). If the value of N is greater than 2, one or more data tables exist in the N data tables in a non-blocking state, and the connection mode of the N data tables can be determined to be the connection of the memory area to the memory area. In this connection mode, connection calculation is specifically performed on data in each memory area.

(2) If the N data tables are in the blocking state, determining that the connection mode of the N data tables is the connection of the disk area to the memory area.

If all the data tables to be connected are in a blocking state, which indicates that newly added data in each data table cannot be written into a corresponding memory area, the computer device can process the data stored in each data table in the disk area by using the idle time, and can determine that the connection mode of the N data tables is the connection of the disk area to the memory area (also simply referred to as disk join memory). In this connection method, the connection calculation can be performed on the unprocessed data in the memory by using the data in the disk area.

(3) If all the N data tables are in the depletion state, the connection mode of the N data tables is determined to be the connection of the disk area to the disk area.

If all the data tables to be connected are in a depletion state, which indicates that the data stored in the corresponding memory area by each data table is processed and no new data is added by each data table, all the data missing in the two connection modes of the previous connection of the memory area to the memory area and the previous connection of the memory area to the memory area can be cleaned, so that the connection mode of the N data tables can be determined to be the connection of the disk area to the disk area (also simply referred to as a disk join disk). In this connection manner, connection calculation can be performed on data in each disk area for which connection calculation has not been performed.

The determining logic of the connection modes described in the above (1) - (3) may specifically integrate the processing states of the N data tables to determine the connection modes of the N data tables, where any one of the connection modes corresponds to a combination of the processing states of each data table, and when the processing states of the N data tables change, the connection mode may also change accordingly, so that the connection modes of the N data tables may be adjusted in time, so that the optimal connection calculation is performed on the N data tables subsequently. Based on the determination of the connection mode, even if more data cannot be buffered to the memory, the connection calculation of the N data tables can be guaranteed to be executed all the time, and further the connection calculation performance can be improved.

And S203, performing connection calculation on the N data tables according to the determined connection mode.

The connection calculation of the N data tables means that the data in two or more data tables are merged together according to the corresponding rule. In one embodiment, the data tables are dynamic tables, and since the dynamic table is a time-varying data table, data can be continuously added, and the processing state of the data table may be continuously changed. Thus, in the connection calculation of the N data tables, for the connection modes mentioned above, it includes: one or more of memory area to memory area connections, disk area to disk area connections may be involved. For example, for connection calculation of 3 data tables, based on the influence of the data continuously reaching the data table on the processing state of the data table, the connection mode between the data tables sequentially adopts three types of connection of the memory area to the memory area, connection of the disk area to the memory area and connection of the disk area to the disk area, so that connection calculation of all the data in the 3 data tables is completed.

Next, the logic for performing connection calculation on the N data tables in the corresponding connection manner will be described in detail. In the embodiment of the application, the connection calculation of the two data tables is illustrated by taking the value of N as 2, namely according to the determined connection mode. It will be appreciated that if N is greater than 2, the same processing logic is provided for the connection between any two data tables.

In one embodiment, any two data tables to be connected in the N data tables are represented as a data table a and a data table B; the data table A corresponds to a processing partition a, and the processing partition a comprises a memory area a1 and a disk area a2; the data table B corresponds to a processing partition B, and the processing partition B includes a memory area B1 and a disk area B2. The processing partition a and the processing partition b are different partitions. The processing partition a (including the memory area a1 and the disk area a 2) is used for storing data of the data table a, and the processing partition B (including the memory area B1 and the disk area B2) is used for storing data of the data table B. Based on this, the connection calculation is performed on the N data tables in accordance with the determined connection manner, including the following (1-1) - (3-1):

(1-1) if the determined connection mode is the memory area-to-memory area connection, merging the data of the data table a stored in the memory area a1 with the data of the data table B stored in the memory area B1.

And under the connection mode of the memory area to the memory area, combining the data stored in the memory area corresponding to the different data tables. Taking the data table a and the data table B as an example, the data of the data table a is stored in the memory area a1, the data of the data table B is stored in the memory area B1, and the data in the two memory areas can be combined. Schematically, if the N data tables include the data table a, the data table B, and the data table C corresponds to the processing partition C, the processing partition C includes the memory area C1 and the disk area C2, and the memory area C1 stores the data of the data table C, at this time, the data stored in the memory area a1, the memory area B1, and the memory area C1 may be combined.

(2-1) if the determined connection method is the connection of the disk area to the memory area, merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the memory area B1.

And under the connection mode of the disk area to the memory area, combining the data in the disk area corresponding to one data table with the data in the memory area corresponding to the other data tables. Taking data table a and data table B as an example, data of data table a is stored in disk area a2, and data of data table B is stored in memory area B2. The data in the disk area a2 and the data in the memory area b1 may be combined. Illustratively, if the N data tables include data table a, data table B and data table C, then a disk area corresponding to one data table, for example, disk area B2 of data table B, may be selected, and then the data of data table B stored in disk area B2 is combined with the data of data table a stored in memory area a1 and the data of data table C stored in memory area C1.

(3-1) if the determined connection method is disk-to-disk connection, merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the disk area B2.

And under the disk area-to-disk area connection mode, the data stored in the disk areas in each data table can be combined. Taking data table a and data table B as an example, data of data table a is stored in disk area a2, data of data table B is stored in disk area B2, and data in these two disk areas may be combined. Schematically, if the N data tables include the data table a, the data table B, and the data table C corresponds to the processing partition C, the processing partition C includes the memory area C1 and the disk area C2, and the memory area C1 stores the data of the data table C, at this time, the data stored in the disk area a2, the disk area B2, and the disk area C2 may be combined.

It should be noted that, the merging of the data in the foregoing (1-1) to (3-1) refers to merging of the matched data, and specifically may be merging of the data based on the table field of the connection, for example, the data table a and the data table B are connected by the field attribute t, that is, A.t = B.t.

In one embodiment, each memory area includes an input buffer area and an output buffer area, wherein any buffer area is a reserved storage space in the memory area, the input buffer area can be used for buffering input data, and the output buffer area can be used for buffering output data. For example, the data read from the disk area may be placed in the output buffer, and the computer device may then fetch the data from the output buffer, thereby reducing the read/write operations on the disk area. For another example, newly added data in the data table may be placed in the input buffer for processing, so as to be written to the disk area after the data processing in the input buffer. In one implementation, when the connection calculation is performed on the memory area according to the disk area, the data in the disk area may be read into the output buffer area first, and the computer device may read the corresponding data from the output buffer area and then combine the data stored in the other memory areas. When connection calculation is performed on the disk areas according to the disk areas, data to be connected and calculated in the disk areas can be read into the output buffer area, and then connection calculation is performed in the memory area.

Next, detailed description will be made on a specific implementation manner of merging the data in the data table in the corresponding connection manner. For ease of understanding, data table a and data table B are still described as examples. The memory hash table corresponding to the data table A is stored in the memory area a 1; the memory hash table corresponding to the data table B is stored in the memory area B1; the memory hash table corresponding to the data table B includes one or more hash values, where each hash value is used to represent one data stored in the memory area B1 in the data table B. When the data (e.g., data t) in the corresponding data table (e.g., data table a) leaves the memory area (e.g., memory area a 1), for example, the data t is written from the memory area to the disk area after being processed, and the hash value (e.g., hash value of data t) of the corresponding data in the memory hash table corresponding to the data table (e.g., data table a) is also deleted from the memory hash table.

Based on this, the following (one) - (three) are included for the specific implementation in each connection mode.

The data table A and the data table B are dynamic tables, and the newly added data of the data table A is data i ₁ . If the determined connection mode is the connection of the memory area to the memory area, the data of the data table a stored in the memory area a1 and the data of the data table B stored in the memory area B1 are combined, which includes the following implementation steps S11-13.

Step S11: if the determined connection mode is the connection of the memory area to the memory area, when the data i is received ₁ When the hash value of (2) is applied, data i ₁ The hash value of (a) is inserted into the memory hash table corresponding to the data table a stored in the memory area a 1.

If the connection mode is determined to be the connection of the memory area to the memory area, at least one data table in the N data tables is in a blocking state. Assuming that data table A is in a non-blocking state, there is sufficient memory space in memory area a1 of data table A to store data i ₁ Is used to generate the hash value of (a). Thus, upon receiving data i ₁ Can directly convert the data i into the hash value of (a) ₁ Is inserted into the corresponding memory hash table of the data table a for use in probing.

It will be appreciated that in the manner of the memory area to internal area connection, if data table A is in a blocked state, e.g., memory area a1 of data table A is full, data i may be stored ₁ Refreshing to disk area a2 and performing subsequent stage on data i ₁ And (5) processing. In the blocking state of data table A and in the non-blocking state of data table B, for the newly added data j in data table B ₁ The new data j can also be added in a similar manner to steps S11-S13 ₁ And merging the data in the corresponding memory area with the data in the other data tables.

Step S12: based on data i ₁ In (2) hash value ofDetecting in a memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₁ Is used to generate the hash value of (a).

In one possible implementation, the data i may be based on ₁ The hash value of (2) searches the data i from the memory hash table corresponding to the data table B stored in the memory area B1 ₁ Hash values consistent with the hash values of (2), and further determining the data corresponding to the found hash values as data j ₁ Wherein, data j ₁ Is detected, need and data i ₁ Matching items for connection are made.

Step S13: data i ₁ Hash value of (2) and data j ₁ Is combined.

The data in the data table can be represented in a memory area or a disk area by a hash value, and a mapping relation exists between the hash value and corresponding data. Thus, in the process of acquiring the data j ₁ After the hash value of (a), data i can be applied ₁ Hash value of (i) and its matching item (i.e. data j ₁ ) Combining hash values of (a) based on a mapping between the hash values and the data, i.e. representing the data i ₁ And data j ₁ Merging to obtain one result data { i } ₁ ，j ₁ }。

It will be appreciated that based on data i ₁ The hash value of (c) may not be detected with data i ₁ Matching items for connection are made. In this case, step (2) may be repeatedly performed until a matching item is detected after waiting for new data in the other data table. If no match is detected all the time, writing the newly added data into the disk area when the data table A is in a depletion state, and merging in a disk area-to-disk area connection mode.

The data of the data table a stored in the disk area (two) a2 includes data i ₂ The data i ₂ Is any data stored in the disk area a2 corresponding to the data table a. If the determined connection mode is the connection of the disk area to the memory area, the data of the data table a stored in the disk area a2 and the data of the data table B stored in the memory area B1 are combined, which includes the following steps S21-S22.

Step (a)21: if the determined connection mode is the connection of the disk area to the memory area, the connection mode is based on the data i ₂ Detecting in the memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₂ Is used to generate the hash value of (a).

If the determined connection mode is the connection of the disk area to the memory area, the data tables are in a blocking state. Matching data may be detected from the memory areas of the other data tables based on the data in the disk area a 2. Wherein, data i ₂ The hash value of (a) may be one of the hash values of the disk corresponding to the data table a, and may be used to represent the data i ₂ . The disk hash table corresponding to the data table a includes one or more hash values, each of which is used to represent one data of the data table a stored in the disk area a 2. The disk hash table may be understood as the disk resident portion of the target hash table corresponding to data table a. The target hash table comprises hash values corresponding to all data in the data table A, and is a complete hash table.

In one possible implementation, the data i may be based on ₂ The hash value of (2) searches the data i from the memory hash table corresponding to the data table B stored in the memory area B1 ₂ Hash values consistent with the hash values of (2), and further determining the data corresponding to the found hash values as data j ₂ Wherein, data j ₂ Is detected, need and data i ₂ Matching items for connection are made.

Step S22: data i ₂ Hash value of (2) and data j ₂ Is combined.

The data in the data table can be represented in a memory area or a disk area by a hash value, and a mapping relation exists between the hash value and corresponding data. Thus, in the process of acquiring the data j ₂ After the hash value of (a), data i ₂ Hash value of (2) and matching item data j thereof ₂ Combining hash values of (a) based on a mapping between the hash values and the data, i.e. representing the data i ₂ And data j ₁ Merging to obtain one result data { i } ₂ ，j ₂ }. Under the connection of disk area to memory area, data table A and data tableThe data table B may be a dynamic table or a static table, which is not limited in this aspect of the present application.

It can be seen that, for a plurality of data tables to be connected, if each data table is blocked, the connection mode can be determined as the connection of the disk area to the memory area, and the data in the disk area can be processed in the connection mode. The computer equipment can read the data of any data table in the disk area into the memory area, and then detect the memory hash tables corresponding to other data tables stored in the corresponding memory areas, so that the data read into the memory area and the matching items obtained based on detection are combined to obtain result data. In this way, even if the data table is in a blocking state, the connection calculation can still be continued according to the above steps S21-S22, and corresponding result data can be generated. The connection calculation of the N data tables can be continuously performed without being influenced by whether the data tables are blocked or not, so that the connection calculation of the N data tables is kept to be performed all the time before all the data processing is completed, the process of the connection calculation is accelerated, and the final result of the connection calculation is obtained quickly.

In addition, processing is performed in the determined connection mode of the disk area to the memory area, and with the continuous processing of the data in the data table, there may be a change of the data table in the blocking state (such as the data table a or the data table B) from the blocking state to the non-blocking state, and then the connection calculation between the N data tables may use the connection of the memory area to the memory area as described in (a).

The data of the data table a stored in the disk area a2 includes data i ₃ The data i ₃ Any data in the data table a stored in the disk area a2 may be specifically the data refreshed into the disk area in the data table a. The disk hash table corresponding to the data table B is stored in the disk area B2, and the disk hash table contains one or more hash values, and each hash value is used for representing one data of the data table B stored in the disk area B2. Any hash value in the disk hash table may be a hash value written after the data processing in the memory area b1 is completed, or may be a hash value that is refreshed to the disk area b2 when the data is refreshed to the disk area b 2. If you getThe implementation step of merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the disk area B2 includes the following steps S31 to S33.

Step S31: if the determined connection mode is the connection of the disk areas, selecting a disk area a2, and constructing a memory hash table corresponding to the data table A on the disk area a 2.

If the determined connection mode is the connection of the disk area to the disk area, it indicates that the data of the data table a stored in the memory area a1 is processed and no new data is added. And then the data missing in the connection calculation can be cleaned up under one or more connection modes of the previous connection of the memory area to the memory area and the connection of the disk area to the memory area. In one implementation, volume a2 may be selected, where volume a2 is arbitrarily selected from volume a2 and volume b2, and then a memory hash table is constructed on volume a 2. Wherein the constructed memory hash table contains data i ₃ Also contains the hash value of the data written from the memory area a1 to the disk area a 2.

Step S32: based on data i ₃ The hash value of (2) is detected in a disk hash table corresponding to the data table B to obtain data j ₃ Is used to generate the hash value of (a).

In one possible implementation, the data i may be based on ₃ The hash value of (2) searches for the data i from the disk hash table corresponding to the data table B stored in the disk block B2 ₃ Hash values consistent with the hash values of (2), and further determining the data corresponding to the found hash values as data j ₃ Wherein, data j ₃ Detected demand and data i ₃ Matching items for connection are made.

Step S33: data i ₃ Hash value of (2) and data j ₃ Is combined.

The data in the data table can be represented in a memory area or a disk area by a hash value, and a mapping relation exists between the hash value and corresponding data. Thus, in the process of acquiring the data j ₃ Is of (1)After the value of the hash, data i may be stored ₃ Hash value of (2) and data j ₃ Is combined. Based on a mapping between hash values and data, i.e. representing the data i ₃ And data j ₃ Combining to obtain a result data { i } ₃ ，j ₃ }。

The connection mode of the disk areas is similar to the hybrid hash connection mode, and the result data can be finally generated by arbitrarily selecting the disk area of one data table, then constructing a memory hash table on the disk area, reading the data in the disk area corresponding to other data tables, and detecting the hash value in the disk hash table. In addition, the connection calculation is performed on the N data tables according to the connection of the disk areas to the disk areas, and the memory can be reallocated for the connection calculation.

It will be appreciated that the connection calculation performed in the connection manner described in (one) to (three) above can be used as an optimization for hash join (hash join). In the hash connection process, a data table with smaller data volume (called a small table for short, such as the data table a in the present application) is first scanned, and a memory hash table is built in a memory area by using a connection key (i.e. calculating a hash value according to a connection field). Then, a data table with larger data volume (called a large table for short, such as the data table B in the application) is scanned, the memory hash table is detected once every time a record is read in the large table, and a row matched with the memory hash table is found. When the data table cannot be completely put into the memory, the data table can be divided into different parts, the part which cannot be put into the memory area can be written into the disk area, a corresponding disk hash table exists in the disk area, and data in the disk area can be replaced into the memory area for hash connection.

In the connection calculation process, in order to improve the calculation efficiency and the calculation accuracy, it is necessary to ensure that repeated result data are not generated. In the connection of the disk area to the memory area and the connection manner of the disk area to the disk area, repeated connection calculation may be performed, and then repeated result data are generated, wherein the repeated connection calculation is redundant connection calculation, and the repeated result data is redundant data. Thus, in one possible implementation, it is also possible to: and performing de-duplication detection on the connection calculation process, and outputting a connection calculation result according to the de-duplication detection result.

The duplicate removal detection is performed on the connection calculation process, on the one hand, the duplicate connection calculation can be detected, and the obtained duplicate removal detection result is used for indicating the duplicate connection calculation. Based on the duplicate removal detection result, the connection calculation of the corresponding data is only executed once in the connection calculation process, and repeated execution of the same connection calculation can be effectively avoided, so that the output connection calculation result has no redundancy. On the other hand, redundant data obtained by performing repeated connection calculation may be detected, and a deduplication detection result may be obtained, where the deduplication detection result is used to indicate repeated result data. And removing repeated result data based on the duplicate removal detection result, and reserving only one repeated result data, so as to obtain the final result output of the connection calculation.

In one possible embodiment, the determined connection mode includes: the connection of the disk area to the memory area or the connection of the disk area to the disk area. In the process of performing connection calculation on the N data tables according to the connection of the disk area to the memory area or performing connection calculation on the N data tables according to the connection of the disk area to the disk area, duplicate removal detection can be performed on the connection calculation process, so that repeated connection calculation is avoided. For ease of understanding, any two data tables among the N data tables, and any one data in the data tables are described below as an example. Any two data tables to be connected in the N data tables are represented as a data table A and a data table B, wherein the data table A comprises data i, and the data table B comprises data j. The data i is any data in the data table, and the data j is any data in the data table B.

The deduplication detection of the process of connection calculation may include the following steps (1) -step (2).

Step (1): in the process of connection calculation, if the matching item to be connected with the data i is detected as the data j, the storage space relation between the data i and the data j is acquired.

The memory space relationship between data i and data j may be used to indicate that data i and data j are stored

Step (2): based on the indication of the memory space relationship, performing deduplication processing on the connection calculation between the data i and the data j. The deduplication processing is performed on the connection calculation between the data i and the data j, so that the connection calculation between the data i and the data j is performed only once and is not performed repeatedly.

Based on different indications of memory space relationships, the following two types may be included:

1. if the memory space relationship indicates that the data j still exists in the memory area b1 when the data i is stored in the memory area a1, the deduplication process is performed on the connection calculation between the data i and the data j.

According to the application, the de-duplication detection can be based on the time stamp, and any data in any data table is respectively distributed with at least one time stamp; the at least one timestamp includes a start timestamp (T), or alternatively, includes a start timestamp (T) and an end timestamp (T). The start time stamp is a time stamp of entering the data into the memory area, or a time stamp of directly writing the data into the disk, and the end time stamp is a time stamp of leaving the data from the memory area. Optionally, if the data in any data table is stored in the memory area corresponding to any data table, the data carries a start time stamp (arrival (T)), and if the data in any data table is the data written from the memory area into the disk area, the data is stored in the disk area and carries a start time stamp (arrival (T)) and an end time stamp (arrival (T)).

In one implementation, data i carries a start timestamp and data j carries a start timestamp and an end timestamp. The starting time stamp of the data i is the time stamp of the data i entering the memory area, the starting time stamp of the data j is the time stamp of the data j entering the memory area, and the ending time stamp of the data j is the time stamp of the data j leaving the memory area.

Based on this, if the start timestamp carried by the data i is between the start timestamp and the end timestamp carried by the data j, the storage space relationship provided between the data i and the data j indicates: when the data i is stored in the memory area a1, the data j still exists in the memory area b 1.

Concrete embodimentsIf, however, the data i carries a start time stamp arival (T _i ) A start time stamp arival (T) carried at data j (i.e. the matching entry of data i) _j ) And an end time stamp Departure (T _j ) Between, i.e. arival (T) _i )>arrival(T _j ) And arival (T) _i )<departure(T _j ) The time stamp indicating that the data j enters the memory area B1 corresponding to the data table B is earlier than the time stamp indicating that the data i enters the memory area a1 corresponding to the data table a, and when the data i enters the memory area a1 corresponding to the data table a, the data j does not leave in the memory area B1 corresponding to the data table B, neither the data i nor the data j leave the respective memory area, and the hash value of the data i and the hash value of the data j both exist in the respective memory hash table (i.e. exist in the memory resident part of the respective hash table at the same time), so that the data T in the data table a being scanned can be determined _i Matching item T for testing _j ，This connection calculation has been performed, wherein +.>Representing the connection calculation. Thus, can be about>This connection calculation is de-duplicated so that +.>This connection calculation is performed only once.

2. If the storage space relationship indicates that the data j still exists in the memory area b1 when the data i is written from the memory area a1 to the disk area a2, the deduplication process is performed on the connection calculation between the data i and the data j.

When the data i is written from the memory area a1 to the disk area a2, the data j still exists in the memory area b1, which indicates that the data i has performed the connection calculation, and the memory hash table stored in the memory area b1 has been detected, thereby determining that the data i and the data j have performed the connection calculation.

The processing partition of each data table stores a processing log, and the processing log is used for recording a reference time stamp of the latest data writing in the corresponding processing partition and a detection time stamp corresponding to the detection of the memory hash table stored in the corresponding memory partition. Alternatively, the processing log may be stored in a disk area of the data table, and the detection timestamp refers to a latest timestamp of detecting a memory hash table stored in a memory area corresponding to the corresponding data table. For example, if connection calculation is performed on the connection of the memory area according to the disk area at 19, the detection timestamp is 19. Based on the processing log, the detection of when the memory hash table in the corresponding memory partition is used for the connection mode of the disk area to the memory area and the timestamp of the latest data in the processing partition of the corresponding data table can be monitored.

In one implementation, if the reference timestamp recorded in the processing log of the data table a is greater than the start timestamp of the data j, or if the probe timestamp recorded in the processing log of the data table a is between the start timestamp and the end timestamp of the data j, the storage space relationship between the data i and the data j indicates that: when the data i is written from the memory area a1 to the disk area a2, the data j still exists in the memory area b 1.

Specifically, if the reference timestamp recorded in the processing log of the data table a is greater than the start timestamp of the data j, that is, the last (part (T) _i ))>arrival(T _j ) Then the connection calculation is performed on the data in the data block where the data i is located in the disk area a2 and the data j, so that repeated join calculation is not required, and the computer device can automatically skip the data in the data block where the data i is located. If there are more volumes corresponding to other data tables, a comparison of similar time stamps as described above may be performed to determine whether to skip the connection calculation with the data in the corresponding volume.

If the probe timestamp probe (part (T) _i ) At the start time stamp of data j (T) _j ) And an end time stamp Departure (T _j Between (T) _i ))>arrival(T _j ) And probe (part (T) _i ))<departure(T _j ) Then, it is explained that when the data i is refreshed to the disk area and is used for detecting the memory hash table in the memory area corresponding to the other data table, the memory hash table in the processing partition where the data i is located can be determined in the memory area, and is detected in the process of performing connection calculation on the memory area according to the disk area, so that it is determined that the connection calculation between the data i and the data j can be repeatedly performed, and the deduplication processing can be performed on the connection calculation between the data i and the data j, so that the connection calculation between the data i and the data j is performed once without being repeatedly performed.

Wherein the storage space relationship indicates that the data j is also present in the memory area b1 when the data i is stored in the memory area a1, or the storage space relationship indicates that the data j is still present in the memory area b1 when the data i is written from the memory area a1 to the disk area a2, it can be determined that the connection calculation has been performed between the data i and the data j.

Therefore, in order to avoid repeated connection calculation, the application records the time when the data is processed by allocating corresponding time stamp to each data in the data table and records the time stamp detected by the hash table by processing the log, thereby detecting redundant connection calculation by comparing the corresponding time stamps, and better avoiding repeated connection calculation, so as to avoid repeated result, save calculation resources and improve calculation efficiency.

For the various connection modes introduced above, and the connection calculation performed on the N data tables according to the corresponding connection modes, the processing of memory overflow and the redundant connection calculation can be better handled. To better understand the above, the following example content is provided.

It is assumed that there are three data tables for connection calculation, namely, data table S1, data table S2, and data table S3, and each data table is a dynamic table. Each data table has a processing partition, and the memory area of each processing partition contains a buffer area, and the size of the buffer area is 2, and can accommodate two data. Each data is represented by a value (i.e., a field value) of the data in the connection attribute.Each data in the memory area carries a start time stamp (arrival (T) _i ) To indicate the time stamp of the data into the memory area, e.g. x (12) to indicate the time stamp of the data x into the memory area (T) _x ) =12. Each data in the volume carries a start time stamp arival (T _i ) And an end time stamp Departure (T _i ). The duration of the corresponding data residing in the memory area is known based on the start time stamp and the end time stamp. For example, z (2-11) represents the timestamp arival (T) of the data z into the memory area _z )＝2、departure(T _z ) =11. The duration of the residence of data z in the memory area is 9. In addition, each disk area includes a processing log, and the processing log includes a probe time stamp probe9part (T) corresponding to the last memory hash table probe used for executing the memory area _i )). If part (T) _i ) = -1, then it indicates that the memory hash table in the memory region is not used for probing. The processing log also contains a timestamp last (part (T) _i ) I.e., the latest timestamp of the data in the volume. For the data in the disk and memory of each data table, as shown in fig. 3 a. In addition, the units of time stamps in this example are all hours (H).

It is assumed below that two data are reached in the data table S3, the first data having a value f and the second data having a value b. The time stamp 15 is given based on the arrival time of the first data f, and since the buffer area in the memory area corresponding to the data table S3 can also accommodate data, the hash value of the data f can be received, and inserted into the memory hash table corresponding to the data table S3, and then the matching item to be connected with the data f is detected from the memory hash tables corresponding to the other data tables. Each data table corresponds to a memory hash table for detection of connection computation. An execution engine in a computer device may define an order in which memory hash tables are probed before execution begins, e.g., probing from a hash table corresponding to a smaller data table if the probing sequence is { S } ₁ ,S ₂ Temporary result data { b }, may be generated ₁₃ ,b ₈ If the detection sequence is { S } ₂ ,S ₃ Data f is present in the memory region, at which time a temporary connection can be madeTime result data { f ₁₄ ,f ₁₅ }. As shown in fig. 3 b.

Since the size of the allocated buffer area in the memory area of the data table S3 is 2 (only 2 data can be accommodated), after the data f is written into the buffer area in the memory area, 2 data (data f and data x respectively) are also stored in the buffer area, as shown in fig. 3c, if the data b arrives at the data table S3, the hash value of the data b is inserted into the hash table of S3, which results in memory overflow. At this time, the data x and the data f in the memory area of the data table S3 may be written from the memory area to the disk area when the time stamp is 16, and then the hash value of the data b may be inserted into the hash table of the data table S3 when the time stamp is 17, thereby avoiding the problem of memory overflow. Since the data tables S3 are in the non-blocking state, the connection manner between the data tables can be determined as the connection (join) between the memory areas, so that the data b with different time stamps reaching the memory areas are combined to obtain the result data { b (13) b (8) b (17) }, and the above procedure can be shown in fig. 4 a. Thus, the newly arrived data b and other hash tables join can be realized to obtain the result data.

Then, the data table S3 further continues with new data, here, it is assumed that the data d arrives, and a corresponding start time stamp is allocated for the data d based on the time of arrival of the data d, when the data d is written into the memory area of the data table S3, the data table S3 is changed from the non-blocking state to the blocking state, and the other data tables S1 are always in the blocking state, at this time, the inputs of the data tables are blocked, so that the connection mode is determined to be the connection of the disk area to the memory area. In this connection manner, the data in the disk area of one data table may be arbitrarily selected for connection, for example, join may be performed by selecting the data in the disk area of the data table S2, and the memory hash tables of the data table S1 and the data table S3 in the memory area may be detected, and after the matching item is detected, the result data after calculation of connection may be output. The join is performed on the data d arriving at different time stamps in the memory area and the data d in the disk area to obtain a result data, which is { d (10) d (3-7) d (18) }. It will be appreciated that if the data in data table S1 is selected, the resulting data is { d (5-9) d (3-7) d (18) }.

Suppose at d ₁₈ After the data are reached, all the data tables are not newly added, and the data in each data table are processed, so that each data table can be determined to be in a depletion state, and then the connection mode of N data tables is determined to be the connection of a disk area to a disk area, and the missed data in the data tables can be finally cleaned in the connection mode. Specifically, as shown in fig. 4c, the processing procedure corresponding to the connection method of the disk area to the disk area may be shown in fig. 4c, and after each data with the same value d is connected, two result data may be obtained, where the result data is { d }, respectively _5-9 ,d _3-7 ,d ₁₈ }，{d ₁₀ ,d _3-7 ,d ₁₈ }. But due to the result data d ₁₀ ,d _3-7 ,d ₁₈ The method has been calculated at the previous stage (i.e. in the connection mode of disk area to memory area), so that it is unnecessary to repeat the calculation, but only one result data, i.e., { d }, is calculated _5-9 ,d _3-7 ,d ₁₈ }. For the result data { d } ₁₀ ,d _3-7 ,d ₁₈ The determination that the connection calculation has been performed may be based on data d ₁₈ Comparing with time stamps in other partitions, in particular d ₁₀ The processing partition is a memory area, and when the time stamp is 19, the connection of the disk area to the memory area is performed to obtain result data { d } ₁₀ ,d _3-7 ,d ₁₈ }. When the time stamp is 20, the data in the memory area of the data table S1 are written into the disk area corresponding to the data table S1, wherein d is included ₁₈ The method comprises the steps of carrying out a first treatment on the surface of the And the data stored in the memory area in the data table S2 is also written into the disk area corresponding to the data table S2. In addition, d ₁₈ The disk area corresponding to the data table S3 is also written when the time stamp is 19. Because the data table S1 corresponds to the time stamp last (part (T) _i ) Time stamp of greater than d) =20 ₁₈ Is a start time stamp of arival (T) _i ) =18, which indicates that the data in the disk area is identical to the data d when the disk area resides in the memory area ₁₈ Since the join calculation is performed, it is not necessary to perform repeated join calculation, and the disk area can be automatically skipped during the join calculation. For magnetic fieldsThe data in the disk area which is not join processed can be connected and calculated. As shown in fig. 4c, when the time stamp is 21, the connection calculation can be performed on the data d stored in different disk volumes with different time stamps to obtain the result data { d } _5-9 ,d _3-7 ,d ₁₈ }。

Through the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area, the corresponding connection mode can be determined according to the processing state of the data table, so that the problem of memory overflow can be better solved. In addition, in the connection calculation process, repeated connection calculation can be avoided based on the time stamp, so that the efficiency of connection calculation is improved well, and the calculation performance is improved.

In one embodiment, the connection computation between the N data tables is performed in accordance with the connection computation logic between the N data tables; the following contents of steps S41 to S43 may also be performed before the connection calculation is performed on the N data tables.

Step S41, obtaining connection instruction information about the N data tables.

The connection instruction information is used for instructing connection conditions of the N data tables for connection calculation. The acquisition logic for the connection indication information may be: firstly, acquiring a query statement, and then analyzing the query statement to obtain connection indication information. The query term is a term indicating the content of the query, and the query term may be any of an SQL (Structured Query Language ) term, an DQL (Data Query Language, data query language) term, and the like, according to the language classification of the query term. According to the source classification of the query statement, the query statement may be a statement in a program written in advance by an object or may be a statement input in real time. The application is not limited in this regard. The query statement in the application can also be called as a query command and can be used for carrying out connection query, wherein the connection query refers to query carried out by connecting a plurality of tables through connection fields and connection conditions. Wherein, connecting a plurality of data tables means that data in two or more data tables are combined together according to a specific rule.

Step S42, constructing a connection tree based on the connection instruction information.

The decision subspace can be constrained through the connection tree, so that a proper connection sequence is determined. The junction tree includes a root vertex and vertices under a plurality of branches. If the connection tree includes two branches, the connection tree may also be referred to as a binary tree. If the junction tree includes more than two branches, the junction tree is a multi-drop tree. The branches contained in the connection tree are determined by the connection conditions between the respective data tables indicated in the connection indication information. Other data table corresponding vertices connected to the same data table may form a branch, where connection conditions may be indicated by association fields between the data tables.

For example, a query statement written using SQL provided by Flink is: select from a, B, C, D, E, fwhere a.key1=b.key and a.key2=c.key and d.key1=e.key and d.key2=f.key. The connection instruction information is where a.key1=b.key and a.key2=c.key and d.key1=e.key and d.key2=f.key.

Then, according to a.key1=b.key and a.key2=c.key, it can be known that the fields of data table B and data table C are both from data table a, therefore, data table B and data table C are both associated with data table a, the vertex representing data table A, B, C in the connection tree can be regarded as one branch, and similarly, according to d.key1=e.key and d.key2=f.key, it can be known that the fields of data table E and data table F are both from data table D, and the other branch is the vertex representing data table D, E, F, respectively. That is, the junction tree generated from the query statement is a binary junction tree, the junction tree containing two branches, the first branch being DC1 = { a, B, C }, the second branch being DC2 = { D, E, F }.

In one embodiment, the connection tree may be constructed based on the connection conditions indicated by the connection indication information and the size of the data table, such that a tightly connected connection tree may be constructed. For any data table (such as data table a), if the data table to be connected with the any data table includes a plurality of data tables, the data amount of each data table can be obtained, then the connection calculation sequence between each data table and the any data table is determined based on the data amount of each data table, and then the vertexes representing each data table can be connected according to the connection calculation sequence, so as to construct and obtain the connection tree. The data amount of the data table is a numerical value for measuring the size of the data table, and the data amount can be the number of rows and columns of the data in the data table or the size of the storage space occupied by the data. The application is not limited in this regard. For the data table with smaller data volume, the connection calculation can be performed with any data table (such as data table A) preferentially, and then the data table with larger data volume is connected with the calculation result obtained by the connection calculation of any data table.

For example, if the data table to be connected to the data table a includes the data table B and the data table C in the above example, the data amounts of the data table B and the data table C may be compared to determine that the data amount of the data table B is smaller than the data amount of the data table C, so that the data table B with the smaller data amount may be preferentially connected to the data table a for calculation. Then in the connection tree, namely: the vertex representing data table A is connected with the same M1 vertex to represent that the intermediate result table is stored in the M1 vertex after connection calculation is performed on the data table A and the data table B, then the vertex representing data table C is connected with the M1 vertex representing the intermediate result table by the same M3 vertex to represent that the connection calculation is performed on the data table C and the intermediate result table to obtain another intermediate result table. The connection tree constructed based on the above-described manner is as shown in fig. 5 a.

The one-time multi-table connection calculation is traced from the tail node of the connection tree to the root vertex, and when the calculation reaches the root vertex, the calculation result is the final result of the multi-table connection. The tail nodes refer to the last vertex in a dependent path, and in the application, the tail nodes are all vertices representing a data table. One junction tree contains only one root vertex to make the calculation result unique. If the connected data table contains a dynamic table, multi-table connection calculation can be performed when the dynamic table changes according to the connection tree, and the multi-table query based on the dynamic table is a continuous query process, so that the corresponding multi-table calculation is also an iterative process.

And S43, analyzing the connection tree, and determining connection calculation logic among the N data tables.

The connection computing logic comprises computing logic of connection-free computing and computing logic of connection-dependent computing; in the process of carrying out connection calculation on the N data tables, parallel execution is carried out on connection-free calculation according to connection calculation logic, and serial execution is carried out on connection-free calculation.

The connection tree is used in the present application for connection computation logic that may be used to describe the steps of how connection computation is performed between N data tables. The link computation logic corresponds to an execution plan, which is a logic plan by which the final link computation results are computed in steps. The connection computation logic represented by the connection tree may be referred to as connection computation logic between the N data tables. Or in the process of analyzing the connection tree, the connection tree can be optimized to obtain an optimized connection tree, and then the connection calculation logic represented by the optimized connection tree is determined as the connection calculation logic among the N data tables. Therefore, connection calculation with higher parallelism can be performed, and the efficiency of connection calculation is improved.

For the connection tree constructed above, which includes a root vertex and multiple branches, the vertices of the calculation result table that represent the connection calculation under each branch depend on two vertices, and the two vertices may include: left and right sub-vertices. Both vertices relied upon may represent a data table, or one vertex represents a data table and the other vertex represents an intermediate result table. Specifically, the optimization of the connection tree may comprise the following steps 1-2.

Step 1, traversing each vertex in the connection tree, and taking the traversed vertex as the current vertex.

For the traversal logic of each vertex in the connection tree, from the dimension of the vertex, the root vertex is taken as a starting point, and the root vertex is sequentially from the left vertex to the right vertex. For example, the connection tree as described above is traversed in the following order: vertex M5- (vertex M3-) vertex M1- (vertex A-) vertex B- (vertex C-) vertex M4- (vertex F-) vertex M2-) vertex D- (vertex E). From the dimension of the branches, namely, the vertex in the left branch in the connection tree is traversed firstly, then the vertex in the right branch is traversed, and for each branch, the sequence of traversing the left sub-vertex and then traversing the right sub-vertex is adopted.

Step 2, if there is a correlation between the data table represented by the right sub-vertex and the data table represented by the left sub-vertex of the current vertex, then all sub-vertices may be used to replace the current vertex. Wherein the presence association may be two data table presence association keys, for example: the associated key in the data table represented by the right sub-vertex is from the data table represented by the left sub-vertex or the associated key in the data table represented by the left sub-vertex is from the data table represented by the right sub-vertex.

Illustratively, the connection conditions when connecting data table C are: "a.key2=c.key", the association key of data table C is taken from the key2 column of data table a. The merge requirement is satisfied, so two vertices (including a and B) can be used instead of the M1 vertex. The current vertex is an M1 vertex, the right sub-vertex of the M1 vertex is a B vertex, and the left sub-vertex of the M1 vertex is an A vertex. After updating the structure of the connection tree in the above manner, the updated connection tree is obtained as shown in (1) of fig. 5 b. For the other branch, the same operation can be performed, and a multidimensional connection tree can be obtained.

It will be appreciated that if there is no association between the data table represented by the right sub-vertex and the data table represented by the left sub-vertex of the current vertex, then the original connection is still maintained. There is no association between the vertex correspondence data tables of different branches of the junction tree, and the junction calculation can use Cartesian product operation. For example, as shown in fig. 5a, when the data tables represented by the a-vertex and the D-vertex are merged, according to the connection indication information, there is no association key between the data tables, and cartesian product operation is performed between the data tables a and D, and the merging condition is not satisfied, so that the original connection between the a-vertex and the D-vertex in the connection tree and the root vertex can be reserved. Finally, an optimized junction tree as shown in (2) of fig. 5b can be obtained.

After the above processing is performed on the entire junction tree, an optimized junction tree, which is a multidimensional junction tree, can be obtained. In the multi-dimensional connection tree, if one vertex depends on at least three sub-vertices, the intermediate result table is indicated to be obtained by connection calculation of a data table represented by each sub-vertex, and each sub-vertex comprises a fact table and at least two dimension tables. This is because the description of the connection conditions between the respective data tables in the connection instruction information is: the associated key in one data table is taken from a certain column in another data table. And based on the relation of the association between the fact table and the dimension table through the association key, the fact table and the dimension table can be determined. Illustratively, one vertex depends on n (n > 2) sub-vertices, and the data tables represented by the respective sub-vertices contain (T1, T2, …, tn,) with data table T1 as the fact table and data tables T2-Tn as the dimension tables in the join computation.

Of course, if a vertex relies on only two sub-vertices, then it is indicated that intermediate results may be obtained by a chain-link or Cartesian product operation (also known as cross-link). Illustratively, the M3 vertex and the M4 vertex as shown in fig. 5b above are obtained by a cartesian product operation.

In one implementation, after the optimized connection tree is obtained, the connection computation logic represented by the optimized connection tree may be used as the connection computation logic between the N data tables, and the computation logic of the connection-free computation and the computation logic of the connection-dependent computation may be further determined. The connection-free calculation refers to a plurality of connection calculations having no dependency, and the connection-dependent calculation refers to a plurality of connection calculations having a dependency. A dependency refers to a dependency on the result of a join computation, such as: the connection calculation of data table M3 and data table M4 is dependent on the result of the connection calculation of data table A, B, C and the result of the connection calculation of data table D, E, F. Alternatively, the connection-free computation may be passed to the distributed processing engine, executed in parallel on a cluster of nodes provided by the distributed processing engine, while the connection-free computation may be executed sequentially.

In one implementation, the N data tables include at least one fact table and a plurality of dimension tables associated with each fact table; the connection-free calculation includes: connection computation between any fact table and each associated dimension table, and connection computation between different fact tables and associated multiple dimension tables.

The join computation (a local join) between the fact table and the various dimension tables is a dependency-free join computation, which can be performed in parallel. The connection computation indicated by the different branches in the connection tree is also a connection-free computation, but the vertex with parent-child relationship inside the branch corresponds to a connection-free computation. After the intermediate result table is obtained by the local join computation, the join computation (e.g., cartesian product operation) may be performed on the intermediate result tables of different branches to obtain a join computation result (i.e., join result). Illustratively, the logic of the connection computation as indicated in (2) in fig. 5B, where data table a and data table F are fact tables, data table B, C, D, E are dimension tables, the connection between data table a and data table B, and the connection computation between data table a and data table C may be performed in parallel, and similarly, the connection computation between data table F and data table D, and the connection computation between data table F and data table D may be performed in parallel. In addition, the calculation for obtaining the connection of the intermediate result table M3 and the intermediate result table M4 may also be performed in parallel. And serial execution is required to obtain the connection calculation of the final result table M5.

Therefore, the scheme can optimize the parallelism of the connection between the data tables based on the tree model of the connection tree (or the optimized connection tree), fully considers the correlation between tables, maximally parallels the connection-free calculation, and improves the overall performance of the multi-table connection operation. In addition, on the basis of the constructed multidimensional connection tree, local connection can be thoroughly mined, connection calculation of the fact table and each dimension table is parallelized, and the efficiency of multi-table connection is greatly improved.

In the process of connection calculation between the fact table and the dimension table, since data in the fact table needs to be transmitted through a network I/O (input/output) for multiple times, join calculation between the fact table and each dimension table can be completed, which affects the performance of the whole connection calculation. Thus, the present application provides a new semi-join algorithm to reduce network I/O overhead. The connection computation between any real table and the associated dimension table employs a hierarchical semi-connection, which refers to: and splitting the fact table into a plurality of local fact tables according to the association with the dimension table, and then carrying out connection calculation on the plurality of local fact tables and the corresponding dimension table to obtain a connection result table. In one embodiment, the calculation of the connections between the fact table and the respective dimension tables may include the following steps (1) -step (4).

And (1) generating a local fact table of each dimension table according to the fact table and each dimension table associated with the fact table.

Each dimension in the fact table may have a dimension table associated with it, and the fact table and the dimension table may be associated by a foreign key field. Therefore, for any dimension table, the foreign key field and the primary key field of the fact table associated with any dimension table can be obtained, and then the local fact table of any dimension table is generated according to the foreign key field and the primary key field of the fact table. It will be appreciated that for each dimension table, a corresponding local fact table can be generated in the manner described above. For example, 3 local fact tables may be generated from 1 fact table and 3 dimension tables, and different local fact tables include the same fact table's primary key field and different foreign key fields to correspond to different dimension tables.

In a specific implementation, the computer device may parse the query statement to obtain an execution plan corresponding to the query statement, where the execution plan may be parsed by a distributed processing engine (such as a flank) to obtain an external key field associated between the fact table and the dimension table. Specifically, the local fact table may be generated using a "select" function provided by the distributed engine to select the parsed foreign key field and the primary key field of the fact table.

Illustratively, the query statement is "select sum (a.fkey) from a, B where a.fkey=b.key", and it is known by parsing that the fact table a is associated with the dimension table B using the fKey field, so that the generated local fact table T contains two fields: fKey and Key (the Key of fact table), wherein Key is the foreign Key field, fKey is the Key field of fact table, the Key field of fact table can be a compound Key field, contain multiple fields, foreign Key field can be a field in the compound Key field.

Also, the distributed engine may provide execution plan optimization functions, namely: the fact table is typically used as the left table in connection computations, and if the fact table and the plurality of dimension tables have the same associated key in connection computations, then the distribution is an execution plan optimizer provided by the engine, which can link the connection computations and merge the runs on one processing node. Thus, the select computation and computation function sum as exemplified above may be combined and executed in one processing node without generating a data network transmission.

And (2) generating a connection result table according to the dimension table and the local fact table.

In step (1), a local fact table required for the connection may be generated for each dimension table. The connection calculation is performed between each dimension table and the corresponding local fact table to obtain a connection result table, and the connection calculation is performed between different dimension tables and the corresponding local fact table in parallel. For example, 3 dimension tables (including D1, D2, and D3) and 3 local fact tables (including T1, T2, and T3) may be subjected to connection calculation to obtain 3 connection result tables, and connection calculation between D1 and T1, connection calculation between D2 and T2, and connection calculation between D3 and T3 may be performed in parallel, so that the efficiency of connection calculation may be improved, and the speed of obtaining the final lookup result table may be increased.

Since the local fact table contains the foreign key field for connecting the appointed dimension table and the primary key field of the fact table, the data volume of each local fact table is smaller. Thus, in one implementation, the local fact table and the dimension table may be network transmitted according to an associated field (i.e., a foreign key field), and data with equal field values corresponding to the associated field may be sent to the same processing node in the distributed processing engine through the network I/O to perform a computing connection, so as to ensure that one piece of data performs the network I/O only once.

Optionally, because the data volumes of the local fact table and the dimension table are smaller, the network transmission can adopt a hash (hash) network transmission mode, copy broadcasting of the local fact table is avoided, and transmission overhead and the data volume required to be transmitted by the network are effectively reduced. For the specific implementation of adopting the hash network transmission, the hash value of the primary key field of the fact table contained in the local fact table can be calculated, then the calculated hash value is sent to the designated processing node in the distributed processing engine according to the preset parallelism, the hash value of the foreign key field in the dimension table is calculated, and then the calculated hash value is modulo and then sent to the designated processing node in the distributed processing engine. The data with the same hash value can be sent to the same processing node in the distributed processing engine for connection calculation, the parallelism can be configured by the distributed processing system in a default mode, or can be set by an object according to the requirement independently, and the data with the same hash value can be used for limiting the number of parallel sending of the data.

Illustratively, when the connection calculation is performed, the hash value of the fKey field in the local fact table T is calculated, the hash value is modulo and then sent to the designated processing node according to the parallelism, the hash of the key field in the dimension table D is modulo and then sent to the designated processing node according to the parallelism, and the same data of the fKey field in the local fact table T and the key field in the dimension table D can be sent to the same processing node to ensure that one piece of data in each table can only perform network IO once. For example, the hash value is the same for the data of fkey=2 in the local fact table T and the data of key=2 in the dimension table D. And the data with the same hash value can be sent to the same processing node for connection calculation.

In one embodiment, each connection result table needs to be connected with any real table, a connection mode adopted by each connection result table and any real table for connection calculation is the same as a connection mode adopted by N data tables for connection calculation, and duplicate removal detection needs to be performed in the connection calculation process so as to output a connection calculation result. In the process of obtaining the connection result table of each dimension table, the connection calculation between the dimension table and the local fact table can be realized in a distributed engine. And in the connection calculation process, the overflow of the processing memory may occur, and the repeated connection calculation and the like are caused. Therefore, the connection calculation between the dimension table and the local fact table can be performed in a connection manner determined based on the processing states of the N data tables, thereby avoiding the above-described problem.

And (3) performing connection calculation on each connection result table and the fact table according to the primary key field of the fact table to obtain an intermediate result table.

The connection result table of any dimension table obtained through the steps contains a query field (namely, an external key field) in any dimension table and a main key field in a fact table. The fact table and the respective connection result tables may perform chain connection calculation through a primary key in the fact table, thereby obtaining an intermediate result table. If the fact table contains a plurality of intermediate result tables, each intermediate result table corresponds to one fact table. A distributed processing engine (e.g., a Flink engine) may be invoked to perform the join computation.

In one embodiment, because the association keys of all connections in the chain-linked computation of the fact table and the connection result tables are the same, the execution plan optimization function provided by the distributed processing engine may link these connection computations to be executed in one node, thereby reducing network traffic and making the connection computation more efficient.

And (4) generating a connection calculation result according to the intermediate result table.

If the fact table contains only one, then the intermediate result table may be used as the connection calculation result. If the fact table contains a plurality of intermediate result tables corresponding to each fact table, connection calculation can be performed to obtain a final connection calculation result. Illustratively, a schematic diagram of the connection computation between two dynamic tables as shown in FIG. 5 c. Each dynamic table is a fact table, the dynamic table A is split into k local fact tables according to the association with the k dimension tables, then the k local fact tables are respectively connected with the k dimension tables to obtain k connection calculation result tables, and then each connection calculation result table is connected with the fact table to obtain an intermediate result table. And obtaining an intermediate result table for the dynamic table B according to the process, and finally, carrying out Cartesian calculation on the 2 inquiry result tables to obtain a final result table. It can be seen that, for each dynamic table, a corresponding local dynamic table (i.e. local fact table) can be obtained according to the association key between the dynamic table and the dimension table, then, for each local fact table, the connection calculation between the local fact table and the dimension table can be performed in parallel, and by maximizing the parallelism of the connection calculation, the calculation efficiency can be greatly improved. For the semi-connection of any dynamic table, see the content shown in fig. 5d, when k takes a value of 4, the 4 dimension tables and the 4 local dynamic tables can be connected in parallel to obtain the corresponding connection result tables respectively, and then the connection result tables are connected to obtain the final connection calculation result.

In the case that the data table represented by each node in the optimized connection tree includes a plurality of fact tables and at least two dimension tables associated with each fact table, for connection calculation between any fact table and at least two dimension tables, connection calculation can be performed by adopting the above-mentioned hierarchical semi-connection mode, so that more efficient connection calculation can be performed with less network I/O cost. Thus, intermediate result tables equal in number to the fact tables can be obtained, and then the connection calculation (such as Cartesian product) is performed on the plurality of intermediate result tables to obtain a final connection calculation result.

In one possible embodiment, the computer device may invoke the distributed processing engine to connect the computations in parallel. Wherein, the distributed processing engine refers to a processing engine supporting distributed parallel computing and can be used for processing stream data, wherein, the stream data refers to continuously generated data such as transaction data, online shopping data, video data and the like. The distributed processing engine in the application can be a Flink, which is an open source distributed stream data processing engine capable of executing any stream data program in a data parallel and pipeline mode, and is a real-time computing engine, and is widely applied to the field of large-scale real-time data processing. By adopting the data processing method in the scheme, the optimal connection sequence can be found, the multi-table connection calculation can be parallelized as much as possible by utilizing the characteristic of the distributed cluster, the multi-table connection is optimized, the dependency relationship of related keys among the data tables is fully considered in the process of connecting a plurality of data tables, the local table connection is searched, the independent connection calculation is parallelized, and the connection calculation of the fact table and the dimension table is parallelized, so that the time of the multi-table connection is reduced. The scheme combines the distributed processing engine Flink, can fully utilize the performance advantages of a Flink lightweight and thread-based calculation model, improves the performance of the distributed processing engine for connecting and calculating a plurality of data tables, and accelerates the data processing speed and analysis.

In the chain connection, the fact table and the connection result table optimized by the half-connection algorithm are connected and calculated, the steps (1) to (4) are serial, and the connection between the local fact table and each dimension table in the step (2) can be calculated in parallel, so that the time cost calculation formula of the connection is expressed as follows:

Cost ^localFact ＝Max(C ₁ ,C ₂ .....C _n )

Cost ^dimJoin ＝Max(R ₁ ,R ₂ .....R _n )

cost＝Cost ^localFact +Cost ^dimJoin +Cost ^finalJoin

wherein: c (C) ₁ ,C ₂ .....C _n Representing the time penalty of querying from fact tables for local fact tables connected to each dimension table, since the select computation is performed in parallel, this stage time penalty depends on the slowest select computation. R is R ₁ ,R ₂ .....R _n Representing the time costs of the connection computations for each local fact table and each dimension table, each connection computation is performed in parallel in the cluster, the time costs of this stage also being dependent on the time costs of the slowest connection computation. Finally, the time cost of the whole connection calculation is the sum of the time cost of three serial calculations. As can be seen from fig. 5b, the link optimization algorithm based on Flink semi-links mainly comprise two types of link operations: local fact table dimension table connections, fact table result table connections. For computing performance and storage performance, the connection computation may be performed by using a Hash connection (i.e. Hash join), and each connection may perform a data network transmission operation, so the network IO cost calculation formula is expressed as:

Cost＝Cost ^dimJoin Cost ^finalJoin

Wherein: l [ i ] represents the size of the first local fact table; di represents the size of the ith dimension table; r < i > represents the size of the ith connection result table; f represents the size of the fact table. Connections between local fact tables and dimension tables are performed, and connections between each connection result table and fact table are different processing nodes. Therefore, when the connection result table is generated, network transmission operation is required to be performed on the local fact table and the dimension table which participate in connection. According to the distributed computing principle, the connection result tables obtained by connection between the local fact tables and the dimension tables distributed on different processing nodes are required to be retransmitted and organized for computing the final result table, so that network transmission operation is required to be carried out on the fact table and each connection result table when the final query result table is generated. The total data volume of the network transmission operation required by the whole connection is the sum of the data volumes of the network transmission operation in two stages.

The whole process relates to the realization of an execution plan optimization algorithm of hierarchical semi-connection, when the connection of a fact table and a dimension table is evaluated, network transmission operation is carried out on the associated key fields in the fact table and the data in the dimension table and the connection result table generated by connection, connection calculation is carried out by taking the main key fields in the fact table as the associated conditions, and connection operation of a plurality of connection result tables is linked together and executed on a single processing node based on the optimized execution plan. The data volume required by network transmission operation in the whole connection process is greatly reduced, the network I/O overhead can be effectively reduced, and the connection speed of a plurality of data tables is improved.

In the scheme, the parallelism of the data table connection is optimized based on the tree model, and the data quantity required to be transmitted by a network when the data tables are connected in parallel is reduced to the maximum extent, so that the overall performance of connection calculation is improved. The data tables capable of being connected and calculated in parallel are connected and calculated in parallel as much as possible, and the parallelism of the connection and calculation is improved, so that the connection and calculation can be carried out more efficiently.

The data processing apparatus according to the embodiment of the present application will be explained in relation to the following.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus may be provided in a computer device provided in an embodiment of the present application, and the computer device may be a terminal device mentioned in the foregoing method embodiment. The data processing means shown in fig. 6 may be a computer program (comprising program code) running in a computer device, which data processing means may be used to perform some or all of the steps of the method embodiment shown in fig. 2. Referring to fig. 6, the data processing apparatus may include the following units:

an obtaining unit 601, configured to obtain N data tables to be connected, and obtain processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of the corresponding data table and comprises a memory area and a disk area; wherein N is an integer greater than 1;

A processing unit 602, configured to determine a connection manner of the N data tables according to a processing state of the N data tables; the connection mode comprises at least one of the following: the connection of the memory area to the memory area, the connection of the disk area to the memory area and the connection of the disk area to the disk area;

the processing unit 602 is further configured to perform connection calculation on the N data tables according to the determined connection manner.

In one embodiment, the processing state includes any one of a blocking state, a non-blocking state, a depletion state; the blocking state is a state that newly added data of the data table cannot be written into a corresponding memory area; the non-blocking state is a state in which newly added data of the data table can be written into a corresponding memory area; the exhaustion state is a state that no new data is added in the data table, and the data of the data table stored in the corresponding memory area is processed.

In one embodiment, any one of the N data tables is denoted as a data table a, where the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, a memory hash table corresponding to the data table A is stored in the memory area a1, the memory hash table corresponding to the data table A comprises one or more hash values, and each hash value is used for representing one data stored in the memory area a1 in the data table A; the newly added data of the data table A is data i ₁ The method comprises the steps of carrying out a first treatment on the surface of the The acquiring unit 601 is specifically configured to:

when receiving data i ₁ Detecting the remaining memory space of the memory area a1 when the hash value of (a) is detected;

if the remaining memory space of the memory area a1 is greater than or equal to the data i ₁ The storage space required by the hash value of (2) determines that the processing state of the data table A is a non-blocking state;

if the remaining memory space of the memory area a1 is smaller than the data i ₁ The processing state of data table a is determined to be a blocking state.

In one embodiment, any one of the N data tables is denoted as a data table a, where the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, and a memory hash table corresponding to the data table A is stored in the memory area a 1; the acquiring unit 601 is specifically configured to:

acquiring the processing speed of the data table A stored in the memory area a 1;

if the processing speed is greater than or equal to a preset speed threshold, determining that the processing state of the data table A is a non-blocking state;

if the processing speed is less than the preset speed threshold, determining that the processing state of the data table A is a blocking state.

Acquiring the processing progress of the data table A stored in the memory area a 1;

if the processing progress indicates that the data of the data table a stored in the memory area a1 is processed and no new data is added in the data table a, determining that the processing state of the data table a is a depletion state.

In one embodiment, the processing unit 602 is specifically configured to:

if at least one data table in the N data tables is in a non-blocking state, determining that the connection mode of the N data tables is the connection of the memory area to the memory area;

if the N data tables are in a blocking state, determining that the connection mode of the N data tables is the connection of the disk area to the memory area;

if all the N data tables are in the depletion state, the connection mode of the N data tables is determined to be the connection of the disk area to the disk area.

In one embodiment, any two data tables to be connected in the N data tables are represented as a data table a and a data table B; the data table A corresponds to a processing partition a, and the processing partition a comprises a memory area a1 and a disk area a2; the data table B corresponds to a processing partition B, and the processing partition B comprises a memory area B1 and a disk area B2; the processing unit 602 is specifically configured to:

if the determined connection mode is the connection of the memory area to the memory area, merging the data of the data table A stored in the memory area a1 with the data of the data table B stored in the memory area B1;

If the determined connection mode is the connection of the disk area to the memory area, merging the data of the data table A stored in the disk area a2 with the data of the data table B stored in the memory area B1;

if the determined connection mode is the disk area-to-disk area connection, merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the disk area B2.

In one embodiment, the processing unit 602 is specifically configured to:

the data table A and the data table B are dynamic tables; the memory area a1 stores therein an inner corresponding to the data table AStoring a hash table; the memory hash table corresponding to the data table B is stored in the memory area B1; the memory hash table corresponding to the data table B includes one or more hash values, each hash value being used for representing one data stored in the memory area B1 in the data table B; the newly added data of the data table A is data i ₁ And data table a is in a non-blocking state; the processing unit 602 is specifically configured to:

if the determined connection mode is the connection of the memory area to the memory area, when the data i is received ₁ When the hash value of (2) is applied, data i ₁ The hash value of (a) is inserted into a memory hash table corresponding to the data table A stored in the memory area a 1;

based on data i ₁ Detecting in the memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₁ Is a hash value of (2); data j ₁ Is detected, need and data i ₁ Matching items for connection are carried out;

data i ₁ Hash value of (2) and data j ₁ Is combined.

In one embodiment, the processing unit 602 is specifically configured to:

the memory hash table corresponding to the data table B is stored in the memory area B1; the data of the data table a stored in the disk area a2 includes data i ₂ The method comprises the steps of carrying out a first treatment on the surface of the The processing unit 602 is specifically configured to:

if the determined connection mode is the connection of the disk area to the memory area, the connection mode is based on the data i ₂ Detecting in the memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₂ Is a hash value of (2); data j ₂ Is detected, need and data i ₂ Matching items for connection are carried out;

data i ₂ Hash value of (2) and data j ₂ Is combined.

In one embodiment, the data of data table A stored in disk zone a2 includes data i ₃ The method comprises the steps of carrying out a first treatment on the surface of the A disk hash table corresponding to the data table B is stored in the disk area B2, and the disk hash table contains one or more hash values, and each hash value is used for indicating that the disk area B2 is stored Data of the data table B; the processing unit 602 is specifically configured to:

if the determined connection mode is disk area-to-disk area connection, selecting a disk area a2, and constructing a memory hash table corresponding to the data table A on the disk area a 2; the constructed memory hash table corresponding to the data table A contains the data i in the memory area in the data table A ₃ Is a hash value of (2);

based on data i ₃ The hash value of (2) is detected in a disk hash table corresponding to the data table B to obtain data j ₃ Is a hash value of (2); data j ₃ Detected demand and data i ₃ Matching items for connection are carried out;

data i ₃ Hash value of (2) and data j ₃ Is combined.

In one embodiment, the processing unit 602 is further configured to: and performing de-duplication detection on the connection calculation process, and outputting a connection calculation result according to the de-duplication detection result.

In one embodiment, the determined connection means includes: the connection of the disk area to the memory area or the connection of the disk area to the disk area; any two data tables to be connected in the N data tables are represented as a data table A and a data table B, wherein the data table A comprises data i, and the data table B comprises data j; a processing unit 602, configured to:

in the process of connection calculation, if a matching item to be connected with the data i is detected as the data j, acquiring a storage space relation between the data i and the data j;

If the storage space relationship indicates that the data j still exists in the memory area b1 when the data i is stored in the memory area a1, performing deduplication processing on connection calculation between the data i and the data j;

if the storage space relationship indicates that the data j still exists in the memory area b1 when the data i is written from the memory area a1 to the disk area a2, the deduplication process is performed on the connection calculation between the data i and the data j.

In one embodiment, the data i carries a start time stamp, the data j carries a start time stamp and an end time stamp, the start time stamp is a time stamp of the corresponding data entering the memory area, and the end time stamp is a time stamp of the corresponding data leaving the memory area; if the start timestamp carried by the data i is between the start timestamp and the end timestamp carried by the data j, the storage space relationship between the data i and the data j indicates that: when the data i is stored in the memory area a1, the data j is also present in the memory area b 1;

a processing log is stored in a processing partition of each data table, and is used for recording a reference time stamp of the latest data writing in the corresponding processing partition and a detection time stamp corresponding to the detection of a memory hash table stored in the corresponding memory partition; if the reference timestamp recorded in the processing log of the data table a is greater than the start timestamp of the data j, or if the probe timestamp recorded in the processing log of the data table a is between the start timestamp and the end timestamp of the data j, the storage space relationship between the data i and the data j indicates that: when the data i is written from the memory area a1 to the disk area a2, the data j still exists in the memory area b 1.

In one embodiment, the connection calculation between the N data tables is performed according to the connection calculation logic between the N data tables, and the obtaining unit 601 is configured to: acquiring connection indication information about N data tables, and constructing a connection tree based on the connection indication information;

a processing unit 602, configured to: analyzing the connection tree, and determining connection calculation logic among N data tables; the connection computing logic comprises computing logic of connection-free computing and computing logic of connection-dependent computing;

in the process of carrying out connection calculation on the N data tables, parallel execution is carried out on connection-free calculation according to connection calculation logic, and serial execution is carried out on connection-free calculation.

In one embodiment, the N data tables include at least one fact table and a plurality of dimension tables associated with each fact table; the connection-free calculation includes: connection computation between any fact table and each associated dimension table, and connection computation between different fact tables and associated multiple dimension tables.

In one embodiment, the connection computation between any one real table and the associated dimension table employs a hierarchical semi-connection, which refers to: splitting the fact table into a plurality of local fact tables according to the association with the dimension table, and then carrying out connection calculation on the local fact tables and the corresponding dimension table to obtain a connection result table;

The connection mode adopted by the connection calculation of each connection result table and any event table is the same as the connection mode adopted by the connection calculation of N data tables, and duplicate removal detection is needed in the connection calculation process so as to output the connection calculation result.

It may be understood that the functions of each functional module of the data processing apparatus described in the embodiments of the present application may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the relevant description of the embodiments of the method, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The following description is provided with respect to a computer device according to an embodiment of the present application.

The embodiment of the application also provides a structural schematic diagram of the computer equipment, and the structural schematic diagram of the computer equipment can be seen in fig. 7; the computer device may include: a processor 701, an input device 702, an output device 703 and a memory 704. The processor 701, the input device 702, the output device 703, and the memory 704 are connected by buses. The memory 704 is used for storing a computer program comprising program instructions, and the processor 701 is used for executing the program instructions stored in the memory 704.

In one embodiment, the computer device may be a computer device in a system as shown in FIG. 1; in this embodiment, the processor 701 performs the following operations by executing executable program code in the memory 704:

In one embodiment, any one of the N data tables is denoted as a data table a, where the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, a memory hash table corresponding to the data table A is stored in the memory area a1, the memory hash table corresponding to the data table A comprises one or more hash values, and each hash value is used for representing one data stored in the memory area a1 in the data table A; the newly added data of the data table A is data i ₁ The method comprises the steps of carrying out a first treatment on the surface of the The processor 701 is specifically configured to:

In one embodiment, any one of the N data tables is denoted as a data table a, where the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, and a memory hash table corresponding to the data table A is stored in the memory area a 1; the processor 701 is specifically configured to:

In one embodiment, the processor 701 is specifically configured to:

In one embodiment, any two data tables to be connected in the N data tables are represented as a data table a and a data table B; the data table A corresponds to a processing partition a, and the processing partition a comprises a memory area a1 and a disk area a2; the data table B corresponds to a processing partition B, and the processing partition B comprises a memory area B1 and a disk area B2; the processor 701 is specifically configured to:

In one embodiment, the processor 701 is specifically configured to:

the data table A and the data table B are dynamic tables; the memory hash table corresponding to the data table A is stored in the memory area a 1; the memory hash table corresponding to the data table B is stored in the memory area B1; the memory hash table corresponding to the data table B includes one or more hash values, each hash value being used for representing one data stored in the memory area B1 in the data table B; the newly added data of the data table A is data i ₁ And data table a is in a non-blocking state; the processor 701 is specifically configured to:

data i ₁ Hash value of (2) and data j ₁ Is combined.

In one embodiment, the processor 701 is specifically configured to:

the memory hash table corresponding to the data table B is stored in the memory area B1; the data of the data table a stored in the disk area a2 includes data i ₂ The method comprises the steps of carrying out a first treatment on the surface of the Treatment ofThe device 701 is specifically configured to:

data i ₂ Hash value of (2) and data j ₂ Is combined.

In one embodiment, the data of data table A stored in disk zone a2 includes data i ₃ The method comprises the steps of carrying out a first treatment on the surface of the A disk hash table corresponding to the data table B is stored in the disk area B2, wherein the disk hash table comprises one or more hash values, and each hash value is used for representing one data of the data table B stored in the disk area B2; the processor 701 is specifically configured to:

data i ₃ Hash value of (2) and data j ₃ Is combined.

In one embodiment, the processor 701 is further configured to: and performing de-duplication detection on the connection calculation process, and outputting a connection calculation result according to the de-duplication detection result.

In one embodiment, the determined connection means includes: the connection of the disk area to the memory area or the connection of the disk area to the disk area; any two data tables to be connected in the N data tables are represented as a data table A and a data table B, wherein the data table A comprises data i, and the data table B comprises data j; a processor 701 configured to:

In one embodiment, the connection computation between the N data tables is performed according to the connection computation logic between the N data tables, and the processor 701 is configured to:

acquiring connection indication information about N data tables, and constructing a connection tree based on the connection indication information;

analyzing the connection tree, and determining connection calculation logic among N data tables; the connection computing logic comprises computing logic of connection-free computing and computing logic of connection-dependent computing;

It should be understood that the computer device described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to the foregoing description, and may also perform the description of the data processing apparatus in the embodiment corresponding to the foregoing description of fig. 2, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted that an exemplary embodiment of the present application also provides a computer-readable storage medium in which a computer program of the foregoing data processing method is stored, which when executed by a processor, performs the description of the data processing method in the embodiment of the present application. That is, when one or more processors loads and executes the computer program, the description of the data processing method in the embodiment may be implemented, which is not repeated herein, and the description of the beneficial effects of the same method is not repeated herein.

The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

In one aspect of the application, a computer program product or computer program is provided. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program to cause the computer device to perform a method of data processing provided in an aspect of an embodiment of the present application.

In one aspect of the application, another computer program product or computer program is provided, the computer program product comprising a computer program which, when being executed by a processor, implements the steps of the data processing method provided by the embodiments of the application.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

The above disclosure is only a few examples of the present application and it is needless to say that the scope of the claims of the present application should not be limited thereto, and the equivalent changes of the claims of the present application still fall within the scope of the present application.

Claims

1. A method of data processing, comprising:

acquiring N data tables to be connected, and acquiring the processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of a corresponding data table, and comprises a memory area and a disk area; wherein N is an integer greater than 1;

2. The method of claim 1, wherein the treatment state comprises any one of a blocked state, a non-blocked state, a depleted state;

wherein the blocking state is a state in which newly added data of the data table cannot be written into the corresponding memory area;

the non-blocking state is a state in which newly added data of the data table can be written into a corresponding memory area;

the depletion state is a state that no new data is added in the data table, and the data of the data table stored in the corresponding memory area are processed.

3. The method of claim 2, wherein any one of the N data tables is denoted as a data table a, the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table a is a dynamic table, the memory area a1 stores a memory hash table corresponding to the data table a, and the data table a corresponds toThe corresponding memory hash table contains one or more hash values, each hash value representing a value to be storedThe saidOne data stored in the memory area a1 in the data table a; the newly added data of the data table A is data i ₁ The method comprises the steps of carrying out a first treatment on the surface of the The obtaining the processing state of the N data tables includes:

when receiving the data i ₁ Detecting the remaining memory space of the memory area a1 when the hash value of the memory area a1 is detected;

if the remaining memory space of the memory area a1 is greater than or equal to the data i ₁ The storage space required by the hash value of the data table A is determined to be in a non-blocking state;

if the remaining memory space of the memory area a1 is smaller than the data i ₁ And determining that the processing state of the data table A is a blocking state.

4. The method of claim 2, wherein any one of the N data tables is denoted as a data table a, the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, and a memory hash table corresponding to the data table A is stored in the memory area a 1; the obtaining the processing state of the N data tables includes:

And if the processing speed is smaller than the preset speed threshold, determining that the processing state of the data table A is a blocking state.

5. The method of claim 2, wherein any one of the N data tables is denoted as a data table a, the data table a corresponds to a processing partition a, and the processing partition a includes a memory area a1 and a disk area a2; the data table A is a dynamic table, and a memory hash table corresponding to the data table A is stored in the memory area a 1; the obtaining the processing state of the N data tables includes:

and if the processing progress indicates that the data of the data table A stored in the memory area a1 are processed and no new data is added in the data table A, determining that the processing state of the data table A is a depletion state.

6. The method of claim 2, wherein determining the connection mode of the N data tables according to the processing states of the N data tables comprises:

if at least one data table in the N data tables is in a non-blocking state, determining that the connection mode of the N data tables is the connection of a memory area to a memory area;

and if all the N data tables are in the depletion state, determining that the connection mode of the N data tables is the connection of the disk area to the disk area.

7. The method according to any one of claims 1, 2 and 6, wherein any two data tables to be connected among the N data tables are represented as a data table a and a data table B; the data table A corresponds to a processing partition a, wherein the processing partition a comprises a memory area a1 and a magnetic disk area a2; the data table B corresponds to a processing partition B, and the processing partition B comprises a memory area B1 and a magnetic disk area B2; the connection calculation is performed on the N data tables according to the determined connection mode, and the connection calculation comprises the following steps:

if the determined connection mode is that the disk area is connected with the memory area, merging the data of the data table A stored in the disk area a2 with the data of the data table B stored in the memory area B1;

If the determined connection mode is a disk-to-disk connection, merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the disk area B2.

8. The method of claim 7, wherein the data table a and the data table B are dynamic tables; the memory area a1 stores a memory hash table corresponding to the data table A; the memory hash table corresponding to the data table B is stored in the memory area B1; the memory hash table corresponding to the data table B includes one or more hash values, where each hash value is used to represent one data stored in the memory area B1 in the data table B; the newly added data of the data table A is data i ₁ And data table a is in a non-blocking state;

if the determined connection mode is a memory area-to-memory area connection, merging the data of the data table a stored in the memory area a1 with the data of the data table B stored in the memory area B1, including:

if the determined connection mode is the connection of the memory area to the memory area, when the data i is received ₁ When the hash value of (2) is applied, the data i is stored ₁ The hash value of the data table A is inserted into a memory hash table corresponding to the data table A stored in the memory area a 1;

based on the data i ₁ Detecting in the memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₁ Is a hash value of (2); the data j ₁ Is detected, is required to be in need of being in contact with the data i ₁ Matching items for connection are carried out;

the data i is processed ₁ Hash value of (c) and said data j ₁ Is combined.

9. The method of claim 7, wherein theThe memory hash table corresponding to the data table B is stored in the memory area B1; the data of the data table a stored in the disk area a2 includes data i ₂ The method comprises the steps of carrying out a first treatment on the surface of the If the determined connection mode is a connection of a disk area to a memory area, merging the data of the data table a stored in the disk area a2 with the data of the data table B stored in the memory area B1, including:

if the determined connection mode is the connection of the disk area to the memory area, the connection mode is based on the data i ₂ Detecting in the memory hash table corresponding to the data table B stored in the memory area B1 to obtain data j ₂ Is a hash value of (2); the data j ₂ Is detected, is required to be in need of being in contact with the data i ₂ Matching items for connection are carried out;

the data i is processed ₂ Hash value of (c) and said data j ₂ Is combined.

10. The method of claim 7, wherein the data of the data table a stored in the disk zone a2 includes data i ₃ The method comprises the steps of carrying out a first treatment on the surface of the The disk hash table corresponding to the data table B is stored in the disk area B2, and the disk hash table comprises one or more hash values, and each hash value is used for representing one data of the data table B stored in the disk area B2; if the determined connection mode is a disk-to-disk connection, merging the data of the data table a stored in the disk a2 with the data of the data table B stored in the disk B2, including:

based on the data i ₃ Detecting in a disk hash table corresponding to the data table B to obtain data j ₃ Is a hash value of (2); the data j ₃ Detected data i to be correlated with said data ₃ Matching items for connection are carried out;

the data i is processed ₃ Hash value of (c) and said data j ₃ Is combined.

11. The method of claim 7, wherein the method further comprises: and performing de-duplication detection on the connection calculation process, and outputting the connection calculation result according to the de-duplication detection result.

12. The method of claim 11, wherein the determined connection mode comprises: the connection of the disk area to the memory area or the connection of the disk area to the disk area; any two data tables to be connected in the N data tables are represented as a data table A and a data table B, wherein the data table A comprises data i, and the data table B comprises data j;

the performing de-duplication detection on the connection calculation process includes:

if the storage space relation indicates that the data j still exists in the memory area b1 when the data i is stored in the memory area a1, performing deduplication processing on connection calculation between the data i and the data j;

If the storage space relationship indicates that the data j still exists in the memory area b1 when the data i is written from the memory area a1 to the disk area a2, performing deduplication processing on connection calculation between the data i and the data j.

13. The method of claim 12, wherein the data i carries a start timestamp, the data j carries a start timestamp and an end timestamp, the start timestamp is a timestamp of entry of the corresponding data into the memory region, and the end timestamp is a timestamp of exit of the corresponding data from the memory region; if the start timestamp carried by the data i is between the start timestamp and the end timestamp carried by the data j, the storage space relationship between the data i and the data j indicates that: when the data i is stored in the memory area a1, the data j also exists in the memory area b 1;

a processing log is stored in a processing partition of each data table, and the processing log is used for recording a reference time stamp of the latest data writing in the corresponding processing partition and a detection time stamp corresponding to the detection of a memory hash table stored in the corresponding memory partition; if the reference timestamp recorded in the processing log of the data table a is greater than the start timestamp of the data j, or if the probe timestamp recorded in the processing log of the data table a is between the start timestamp and the end timestamp of the data j, the storage space relationship between the data i and the data j indicates that: when the data i is written from the memory area a1 to the disk area a2, the data j still exists in the memory area b 1.

14. The method of claim 1, wherein the connection computation between the N data tables is performed in accordance with connection computation logic between the N data tables, the method further comprising:

acquiring connection indication information about the N data tables, and constructing a connection tree based on the connection indication information;

analyzing the connection tree, and determining connection calculation logic among the N data tables; the connection computing logic comprises computing logic of connection-free computing and computing logic of connection-dependent computing;

and in the process of carrying out connection calculation on the N data tables, carrying out parallel execution on the connection-free calculation according to the connection calculation logic, and carrying out serial execution on the connection-free calculation.

15. The method of claim 14, wherein the N data tables include at least one fact table and a plurality of dimension tables associated with each fact table;

the connection-independent calculation includes: connection computation between any fact table and each associated dimension table, and connection computation between different fact tables and associated multiple dimension tables.

16. The method of claim 14, wherein the connection computation between any fact table and the associated dimension table employs a hierarchical semi-connection, the hierarchical semi-connection being: splitting the fact table into a plurality of local fact tables according to the association with the dimension table, and then carrying out connection calculation on the plurality of local fact tables and the corresponding dimension table to obtain a connection result table;

The connection mode adopted by the connection calculation of each connection result table and any fact table is the same as the connection mode adopted by the connection calculation of the N data tables, and duplicate removal detection is needed in the connection calculation process so as to output the connection calculation result.

17. A data processing apparatus, comprising:

the acquisition unit is used for acquiring N data tables to be connected and acquiring the processing states of the N data tables; each data table in the N data tables corresponds to a respective processing partition; the processing partition is used for storing data of a corresponding data table, and comprises a memory area and a disk area; wherein N is an integer greater than 1;

and the processing unit is also used for carrying out connection calculation on the N data tables according to the determined connection mode.

18. A computer device, comprising:

a processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, performs the data processing method according to any of claims 1-16.

19. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program which, when executed by a processor, performs the data processing method according to any one of claims 1-16.

20. A computer program product, characterized in that the computer program product comprises a computer program or computer instructions which are executed by a processor to implement the data processing method according to any of claims 1-16.