CA3142143A1

CA3142143A1 - Method and apparatus for correlating data tables based on kv database

Info

Publication number: CA3142143A1
Application number: CA3142143A
Authority: CA
Inventors: Hu Peng; Qian Sun; Bin Shi; Shijin Gao
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2020-12-16
Filing date: 2021-12-14
Publication date: 2022-06-16
Also published as: CN112487111A

Abstract

A method and an apparatus for correlating data tables based on a KV database, relating to the technical field of big data, and featuring a capability of effectively solving long-tail problems and problems about high computing capacity consumption happening during correlation of data tables. The method includes: according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes; if there is no data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data of the values to update the compute nodes; if there is no data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and using the retrieved data of the values to update the compute nodes and the in-memory database. The apparatus implements the disclosed method.

Description

METHOD AND APPARATUS FOR CORRELATING DATA TABLES BASED ON KV
DATABASE
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of big data, and more particularly to a method and an apparatus for correlating data tables based on a KY database.
Description of Related Art

[0002] In applications of data warehouses, correlation among data tables is a normal operation.
In a distributed computing environment, the existing approaches to correlation among data tables tend to have the following problems:
1. Uneven distribution of numerical indicator data in a fact table causes long-tail problems; and 2. An excessively large dimension table leads to high consumption for loading I/O
and computing capacities.
SUMMARY OF THE INVENTION

[0003] The objective of the present invention is to provide a method and an apparatus for correlating data tables based on a KY database, featuring capability of effectively solving long-tail problems and problems about high consumption happening during correlation of data tables.

[0004] To achieve the foregoing objective, in a first aspect the present invention, a method for correlating data tables based on a KY database is provided, comprising:

[0005] according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;

[0006] if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the Date recue / Date received 2021-12-14 compute nodes; and

[0007] if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.

[0008] Preferably, before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:

[0009] periodically loading incremental data in a dimension table to the KY
database.

[0010] More preferably, the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:

[0011] given that the local compute node includes a fact table storing area and a local buffering area, according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area.

[0012] Exemplarily, the local buffering area is a cache.

[0013] Preferably, the step of using the data of the values to update the compute nodes comprises:

[0014] using the data of the values to update the cache.

[0015] As compared to the prior art, the method for correlating data tables based on a KY
database of the present invention has the following beneficial effects:

[0016] In the disclosed method, when a data-table-correlating request SQL is executed, data of corresponding values may be first retrieved from local compute nodes according to key value fields in the fact table. If data of the corresponding values are found in the local compute nodes, the findings are returned directly. If there is no such data found in the local compute nodes, an attempt to retrieve the data of the corresponding values from the in-memory database is made. If data of the corresponding values are found in the in-memory database, the findings are returned directly and used to update the compute nodes. If there is no such data found in the in-memory database, a further an attempt is made with the KY
database, and the retrieved data of the values are used to update the compute nodes and the in-memory Date recue / Date received 2021-12-14 database.

[0017] It is thus clear that the present invention is designed for business scenes where a fact table and a dimension table are correlated to minimize data distribution (i.e., shuffling) and allow the fact table and the dimension table to be jointly computed at the map end without the risk of data skew. In addition, incremental dimension data may be loaded into the KY
database periodically, so that when correlation of the fact table is performed, dimension data may be acquired from the KY database according to actual dimensions, thereby eliminating the need of always loading and computing the full-volume dimension table and preventing unnecessarily huge capacity consumption.

[0018] In a second aspect, the present invention provides an apparatus for correlating data tables based on a KY database, which is applicable to the method for correlating data tables based on a KY database of the foregoing technical scheme. The apparatus comprises:

[0019] correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;

[0020] first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the compute nodes;

[0021] second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.

[0022] Preferably, given that the local compute node includes a fact table storing area and a local buffering area, correlating and retrieving unit according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area.

[0023] Exemplarily, the local buffering area is a cache.

[0024] Preferably, the step of using the data of the values to update the compute nodes comprises:

[0025] using the data of the values to update the cache.

Date recue / Date received 2021-12-14

[0026] As compared to the prior art, the disclosed apparatus for correlating data tables based on a KY database provides beneficial effects that are similar to those provided by the disclosed method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.

[0027] In a third aspect, the present invention provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for correlating data tables based on a KY database as described previously.

[0028] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclosed method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
BRIEF DESCRIPTION OF THE DRAWINGS

[0029] The accompanying drawings are provided herein for better understanding of the present invention and form a part of this disclosure. The illustrative embodiments and their descriptions are for explaining the present invention and by no means form any improper limitation to the present invention, wherein:

[0030] FIG. 1 is a flowchart of a method for correlating data tables based on a KY database according to one embodiment of the present invention;

[0031] FIG. 2 is a schematic drawing illustrating interactive logic for retrieving data of values according to the embodiment of the present invention; and

[0032] FIG. 3 is another flowchart of the method according to the embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION

[0033] To make the foregoing objectives, features, and advantages of the present invention clearer and more understandable, the following description will be directed to some embodiments as depicted in the accompanying drawings to detail the technical schemes disclosed in these embodiments. It is, however, to be understood that the embodiments Date recue / Date received 2021-12-14 referred herein are only a part of all possible embodiments and thus not exhaustive. Based on the embodiments of the present invention, all the other embodiments can be conceived without creative labor by people of ordinary skill in the art, and all these and other embodiments shall be embraced in the scope of the present invention.
Embodiment 1

[0034] Referring to FIG. 1-FIG. 3, the present embodiment provides a method for correlating data tables based on a KY database. The method comprises:

[0035] according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes; if there is no data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data of the values to update the compute nodes; if there is no data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data of the values to update the compute nodes and the in-memory database.

[0036] In the method of the present invention, when a data-table-correlating request SQL is executed, data of corresponding values may be first retrieved from local compute nodes according to key value fields in the fact table. If data of the corresponding values are found in the local compute nodes, the findings are returned directly. If there is no such data found in the local compute nodes, an attempt to retrieve the data of the corresponding values from the in-memory database is made. If data of the corresponding values are found in the in-memory database, the findings are returned directly and used to update the compute nodes.
If there is no such data found in the in-memory database, a further an attempt is made with the KY database, and the retrieved data of the values are used to update the compute nodes and the in-memory database.

[0037] It is thus clear that the present invention is designed for business scenes where a fact table and a dimension table are correlated to minimize data distribution (i.e., shuffling) and allow the fact table and the dimension table to be jointly computed at the map end without the risk of data skew. In addition, incremental dimension data may be loaded into the KY
database periodically, so that when correlation of the fact table is performed, dimension data Date recue / Date received 2021-12-14 may be acquired from the KY database according to actual dimensions, thereby eliminating the need of always loading and computing the full-volume dimension table and preventing unnecessarily huge capacity consumption.

[0038] In the embodiment described above, before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:

[0039] periodically loading incremental data in a dimension table to the KY
database.

[0040] In the embodiment described above, the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:

[0041] given that the local compute node includes a fact table storing area and a local buffering area, according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area. Exemplarily, the local buffering area is a cache.

[0042] In the embodiment described above, the step of using the data of the values to update he compute nodes comprises:

[0043] using the data of the values to update the cache.

[0044] Referring to FIG. 3, in specific implementations, data tables are correlated according to a data-table-correlating request SQL. Based on the input key value fields, data of corresponding values are retrieved from local compute nodes. If the data of the values are retrieved, they are returned. Otherwise, an attempt of retrieval is made to the in-memory database. If the data of the values are retrieved, the data of the values from the in-memory database are used to update the local compute nodes, and the corresponding the data of the values are returned. If the attempt fails, a further attempt is made with the KY database for retrieving the data of the values, and the retrieved results are used to update the in-memory database and the compute node, and the corresponding the data of the values are returned.

[0045] It is thus clear that the embodiment as described above is suitable for business scenes where a fact table and a dimension table are to be correlated. It uses the KY
database, the cache compute ability, and distributed UDF compute ability to convert SQL
correlation computation into function computation, thereby solving the following problems:

Date recue / Date received 2021-12-14 1. The common join is converted to a function computation, and this helps minimize shuffle process. Since the fact table and the dimension table are correlated and computed at the map end, problems about data skew can be prevented;
2. The incremental dimension data are periodically loaded into for example the KY
database, so that when correlation of the fact table is performed, dimension data may be acquired from the KY database according to actual dimensions. This eliminates the need of always loading and computing the full-volume dimension table and prevents unnecessarily huge capacity consumption., thereby reducing resource consumption of the platform and enhancing compute efficiency.

[0046] To sum up, the embodiments described above have the following beneficial effects:
1. The foregoing logic is packaged using SQL, so the application threshold is lowered and development efficiency is improved;
2. For data skew related to large fact tables, the conventional SQL
optimization approach involves optimizing the distribution of keys so the operation is complicated. With the disclosed method, data distribution of key values is no longer a problem to concern, and compute resources can be directly distributed according to data sizes of the set nodes, so as to achieve uniform distribution of data compute resources and efficient use of big-data compute resources;
3. For a large dimension table, only the dimension data to be used are loaded, so as to reduce compute overheads for acquiring dimension data and improve compute performance; and 4. In the prior art, dimension hotspot data used in individual sessions of SQL

execution are not shared, and data loading/destroying has to repeated every time data are used. In practical business applications, dimension hotspot data for different business compute scenes are rather similar. By introducing the use of an in-memory database, the current hotspot dimension data can be cached, and this allows SQLs to share hotspot data, thereby improving processing performance for big data SQLs.

Date recue / Date received 2021-12-14 Embodiment 2

[0047] The present embodiment provides an apparatus for correlating data tables based on a KY database. The apparatus comprises:

[0048] a correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;

[0049] a first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the compute nodes; and

[0050] a second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.

[0051] Preferably, given that the local compute node includes a fact table storing area and a local buffering area, the correlating and retrieving unit according to a data-table-correlating request SQL, reads the key value fields in the fact table from the fact table storing area, and then retrieves the data of the values from a local buffering area.

[0052] Exemplarily, the local buffering area is a cache.

[0053] Preferably, the step of using the data of the values to update the compute nodes comprises:

[0054] using the data of the values to update the cache.

[0055] As compared to the prior art, the disclosed apparatus for correlating data tables based on a KY database provides beneficial effects that are similar to those provided by the disclose method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
Embodiment 3

[0056] The present embodiment provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for correlating data tables based on a KY database as described previously.

Date recue / Date received 2021-12-14

[0057] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclose method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.

[0058] As will be appreciated by people of ordinary skill in the art, implementation of all or a part of the steps of the method of the present invention as described previously may be realized by having a program instruct related hardware components. The program may be stored in a computer-readable storage medium, and the program is about performing the individual steps of the methods described in the foregoing embodiments. The storage medium may be a ROM/RAM, a magnetic disk, an optical disk, a memory card or the like.

[0059] The present invention has been described with reference to the preferred embodiments and it is understood that the embodiments are not intended to limit the scope of the present invention. Moreover, as the contents disclosed herein should be readily understood and can be implemented by a person skilled in the art, all equivalent changes or modifications which do not depart from the concept of the present invention should be encompassed by the appended claims. Hence, the scope of the present invention shall only be defined by the appended claims.

Date recue / Date received 2021-12-14

Claims

What is claimed is:

1. A method for correlating data tables based on a KV database, the method comprising:
according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and updating the retrieved data into the compute nodes;
and if there is no such data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and updating the retrieved data into the compute nodes and the in-memory database.

2. The method of claim 1, wherein before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:
periodically loading incremental data in a dimension table into the KV
database.

3. The method of claim 1 or 2, wherein the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:
the local compute node includes a fact table storing area and a local buffering area, reading the key value fields in the fact table from the fact table storing area according to a data-table-correlating request SQL, and then retrieving the data of the values from the local buffering area.

4. The method of claim 3, wherein the local buffering area is a cache.
Date recue / Date received 2021-12-14

5. The method of claim 4, wherein the step of updating the data of the values into the compute nodes comprises:
updating the data of the values into the cache.

6. An apparatus for correlating data tables based on a KV database, comprising:
a correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
a first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and updating the retrieved data into the compute nodes; and a second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and updating the retrieved data into the compute nodes and the in-memory database.

7. The apparatus of claim 6, wherein the local compute node includes a fact table storing area and a local buffering area, the correlating and retrieving unit reads the key value fields in the fact table from the fact table storing area according to a data-table-correlating request SQL, and then retrieves the data of the values from the local buffering area.

8. The apparatus of claim 7, wherein the local buffering area is a cache.

9. The apparatus of claim 7, wherein the step of updating the data of the values into the compute nodes comprises:
updating the data of the values into the cache.

Date recue / Date received 2021-12-14

10. A computer-readable storage medium storing therein a computer program, wherein the computer program when executed by a processor performing a method as described in any of claims 1 through 5.

Date recue / Date received 2021-12-14