CA3142143A1 - Method and apparatus for correlating data tables based on kv database - Google Patents

Method and apparatus for correlating data tables based on kv database Download PDF

Info

Publication number
CA3142143A1
CA3142143A1 CA3142143A CA3142143A CA3142143A1 CA 3142143 A1 CA3142143 A1 CA 3142143A1 CA 3142143 A CA3142143 A CA 3142143A CA 3142143 A CA3142143 A CA 3142143A CA 3142143 A1 CA3142143 A1 CA 3142143A1
Authority
CA
Canada
Prior art keywords
data
values
compute nodes
retrieving
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3142143A
Other languages
French (fr)
Inventor
Hu Peng
Qian Sun
Bin Shi
Shijin Gao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3142143A1 publication Critical patent/CA3142143A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A method and an apparatus for correlating data tables based on a KV database, relating to the technical field of big data, and featuring a capability of effectively solving long-tail problems and problems about high computing capacity consumption happening during correlation of data tables. The method includes: according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes; if there is no data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data of the values to update the compute nodes; if there is no data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and using the retrieved data of the values to update the compute nodes and the in-memory database. The apparatus implements the disclosed method.

Description

METHOD AND APPARATUS FOR CORRELATING DATA TABLES BASED ON KV
DATABASE
BACKGROUND OF THE INVENTION
Technical Field [0001] The present invention relates to the technical field of big data, and more particularly to a method and an apparatus for correlating data tables based on a KY database.
Description of Related Art
[0002] In applications of data warehouses, correlation among data tables is a normal operation.
In a distributed computing environment, the existing approaches to correlation among data tables tend to have the following problems:
1. Uneven distribution of numerical indicator data in a fact table causes long-tail problems; and 2. An excessively large dimension table leads to high consumption for loading I/O
and computing capacities.
SUMMARY OF THE INVENTION
[0003] The objective of the present invention is to provide a method and an apparatus for correlating data tables based on a KY database, featuring capability of effectively solving long-tail problems and problems about high consumption happening during correlation of data tables.
[0004] To achieve the foregoing objective, in a first aspect the present invention, a method for correlating data tables based on a KY database is provided, comprising:
[0005] according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
[0006] if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the Date recue / Date received 2021-12-14 compute nodes; and
[0007] if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.
[0008] Preferably, before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:
[0009] periodically loading incremental data in a dimension table to the KY
database.
[0010] More preferably, the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:
[0011] given that the local compute node includes a fact table storing area and a local buffering area, according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area.
[0012] Exemplarily, the local buffering area is a cache.
[0013] Preferably, the step of using the data of the values to update the compute nodes comprises:
[0014] using the data of the values to update the cache.
[0015] As compared to the prior art, the method for correlating data tables based on a KY
database of the present invention has the following beneficial effects:
[0016] In the disclosed method, when a data-table-correlating request SQL is executed, data of corresponding values may be first retrieved from local compute nodes according to key value fields in the fact table. If data of the corresponding values are found in the local compute nodes, the findings are returned directly. If there is no such data found in the local compute nodes, an attempt to retrieve the data of the corresponding values from the in-memory database is made. If data of the corresponding values are found in the in-memory database, the findings are returned directly and used to update the compute nodes. If there is no such data found in the in-memory database, a further an attempt is made with the KY
database, and the retrieved data of the values are used to update the compute nodes and the in-memory Date recue / Date received 2021-12-14 database.
[0017] It is thus clear that the present invention is designed for business scenes where a fact table and a dimension table are correlated to minimize data distribution (i.e., shuffling) and allow the fact table and the dimension table to be jointly computed at the map end without the risk of data skew. In addition, incremental dimension data may be loaded into the KY
database periodically, so that when correlation of the fact table is performed, dimension data may be acquired from the KY database according to actual dimensions, thereby eliminating the need of always loading and computing the full-volume dimension table and preventing unnecessarily huge capacity consumption.
[0018] In a second aspect, the present invention provides an apparatus for correlating data tables based on a KY database, which is applicable to the method for correlating data tables based on a KY database of the foregoing technical scheme. The apparatus comprises:
[0019] correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
[0020] first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the compute nodes;
[0021] second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.
[0022] Preferably, given that the local compute node includes a fact table storing area and a local buffering area, correlating and retrieving unit according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area.
[0023] Exemplarily, the local buffering area is a cache.
[0024] Preferably, the step of using the data of the values to update the compute nodes comprises:
[0025] using the data of the values to update the cache.

Date recue / Date received 2021-12-14
[0026] As compared to the prior art, the disclosed apparatus for correlating data tables based on a KY database provides beneficial effects that are similar to those provided by the disclosed method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
[0027] In a third aspect, the present invention provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for correlating data tables based on a KY database as described previously.
[0028] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclosed method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
BRIEF DESCRIPTION OF THE DRAWINGS
[0029] The accompanying drawings are provided herein for better understanding of the present invention and form a part of this disclosure. The illustrative embodiments and their descriptions are for explaining the present invention and by no means form any improper limitation to the present invention, wherein:
[0030] FIG. 1 is a flowchart of a method for correlating data tables based on a KY database according to one embodiment of the present invention;
[0031] FIG. 2 is a schematic drawing illustrating interactive logic for retrieving data of values according to the embodiment of the present invention; and
[0032] FIG. 3 is another flowchart of the method according to the embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTION
[0033] To make the foregoing objectives, features, and advantages of the present invention clearer and more understandable, the following description will be directed to some embodiments as depicted in the accompanying drawings to detail the technical schemes disclosed in these embodiments. It is, however, to be understood that the embodiments Date recue / Date received 2021-12-14 referred herein are only a part of all possible embodiments and thus not exhaustive. Based on the embodiments of the present invention, all the other embodiments can be conceived without creative labor by people of ordinary skill in the art, and all these and other embodiments shall be embraced in the scope of the present invention.
Embodiment 1
[0034] Referring to FIG. 1-FIG. 3, the present embodiment provides a method for correlating data tables based on a KY database. The method comprises:
[0035] according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes; if there is no data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data of the values to update the compute nodes; if there is no data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data of the values to update the compute nodes and the in-memory database.
[0036] In the method of the present invention, when a data-table-correlating request SQL is executed, data of corresponding values may be first retrieved from local compute nodes according to key value fields in the fact table. If data of the corresponding values are found in the local compute nodes, the findings are returned directly. If there is no such data found in the local compute nodes, an attempt to retrieve the data of the corresponding values from the in-memory database is made. If data of the corresponding values are found in the in-memory database, the findings are returned directly and used to update the compute nodes.
If there is no such data found in the in-memory database, a further an attempt is made with the KY database, and the retrieved data of the values are used to update the compute nodes and the in-memory database.
[0037] It is thus clear that the present invention is designed for business scenes where a fact table and a dimension table are correlated to minimize data distribution (i.e., shuffling) and allow the fact table and the dimension table to be jointly computed at the map end without the risk of data skew. In addition, incremental dimension data may be loaded into the KY
database periodically, so that when correlation of the fact table is performed, dimension data Date recue / Date received 2021-12-14 may be acquired from the KY database according to actual dimensions, thereby eliminating the need of always loading and computing the full-volume dimension table and preventing unnecessarily huge capacity consumption.
[0038] In the embodiment described above, before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:
[0039] periodically loading incremental data in a dimension table to the KY
database.
[0040] In the embodiment described above, the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:
[0041] given that the local compute node includes a fact table storing area and a local buffering area, according to a data-table-correlating request SQL, reading the key value fields in the fact table from the fact table storing area, and then retrieving the data of the values from a local buffering area. Exemplarily, the local buffering area is a cache.
[0042] In the embodiment described above, the step of using the data of the values to update he compute nodes comprises:
[0043] using the data of the values to update the cache.
[0044] Referring to FIG. 3, in specific implementations, data tables are correlated according to a data-table-correlating request SQL. Based on the input key value fields, data of corresponding values are retrieved from local compute nodes. If the data of the values are retrieved, they are returned. Otherwise, an attempt of retrieval is made to the in-memory database. If the data of the values are retrieved, the data of the values from the in-memory database are used to update the local compute nodes, and the corresponding the data of the values are returned. If the attempt fails, a further attempt is made with the KY database for retrieving the data of the values, and the retrieved results are used to update the in-memory database and the compute node, and the corresponding the data of the values are returned.
[0045] It is thus clear that the embodiment as described above is suitable for business scenes where a fact table and a dimension table are to be correlated. It uses the KY
database, the cache compute ability, and distributed UDF compute ability to convert SQL
correlation computation into function computation, thereby solving the following problems:

Date recue / Date received 2021-12-14 1. The common join is converted to a function computation, and this helps minimize shuffle process. Since the fact table and the dimension table are correlated and computed at the map end, problems about data skew can be prevented;
2. The incremental dimension data are periodically loaded into for example the KY
database, so that when correlation of the fact table is performed, dimension data may be acquired from the KY database according to actual dimensions. This eliminates the need of always loading and computing the full-volume dimension table and prevents unnecessarily huge capacity consumption., thereby reducing resource consumption of the platform and enhancing compute efficiency.
[0046] To sum up, the embodiments described above have the following beneficial effects:
1. The foregoing logic is packaged using SQL, so the application threshold is lowered and development efficiency is improved;
2. For data skew related to large fact tables, the conventional SQL
optimization approach involves optimizing the distribution of keys so the operation is complicated. With the disclosed method, data distribution of key values is no longer a problem to concern, and compute resources can be directly distributed according to data sizes of the set nodes, so as to achieve uniform distribution of data compute resources and efficient use of big-data compute resources;
3. For a large dimension table, only the dimension data to be used are loaded, so as to reduce compute overheads for acquiring dimension data and improve compute performance; and 4. In the prior art, dimension hotspot data used in individual sessions of SQL

execution are not shared, and data loading/destroying has to repeated every time data are used. In practical business applications, dimension hotspot data for different business compute scenes are rather similar. By introducing the use of an in-memory database, the current hotspot dimension data can be cached, and this allows SQLs to share hotspot data, thereby improving processing performance for big data SQLs.

Date recue / Date received 2021-12-14 Embodiment 2
[0047] The present embodiment provides an apparatus for correlating data tables based on a KY database. The apparatus comprises:
[0048] a correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
[0049] a first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and using the retrieved data to update the compute nodes; and
[0050] a second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KY database instead, and using the retrieved data to update the compute nodes and the in-memory database.
[0051] Preferably, given that the local compute node includes a fact table storing area and a local buffering area, the correlating and retrieving unit according to a data-table-correlating request SQL, reads the key value fields in the fact table from the fact table storing area, and then retrieves the data of the values from a local buffering area.
[0052] Exemplarily, the local buffering area is a cache.
[0053] Preferably, the step of using the data of the values to update the compute nodes comprises:
[0054] using the data of the values to update the cache.
[0055] As compared to the prior art, the disclosed apparatus for correlating data tables based on a KY database provides beneficial effects that are similar to those provided by the disclose method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
Embodiment 3
[0056] The present embodiment provides a computer-readable storage medium, storing thereon a computer program. When the computer program is executed by a processor, it implements the steps of the method for correlating data tables based on a KY database as described previously.

Date recue / Date received 2021-12-14
[0057] As compared to the prior art, the disclosed computer-readable storage medium provides beneficial effects that are similar to those provided by the disclose method for correlating data tables based on a KY database as enumerated above, and thus no repetitions are made herein.
[0058] As will be appreciated by people of ordinary skill in the art, implementation of all or a part of the steps of the method of the present invention as described previously may be realized by having a program instruct related hardware components. The program may be stored in a computer-readable storage medium, and the program is about performing the individual steps of the methods described in the foregoing embodiments. The storage medium may be a ROM/RAM, a magnetic disk, an optical disk, a memory card or the like.
[0059] The present invention has been described with reference to the preferred embodiments and it is understood that the embodiments are not intended to limit the scope of the present invention. Moreover, as the contents disclosed herein should be readily understood and can be implemented by a person skilled in the art, all equivalent changes or modifications which do not depart from the concept of the present invention should be encompassed by the appended claims. Hence, the scope of the present invention shall only be defined by the appended claims.

Date recue / Date received 2021-12-14

Claims (10)

What is claimed is:
1. A method for correlating data tables based on a KV database, the method comprising:
according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and updating the retrieved data into the compute nodes;
and if there is no such data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and updating the retrieved data into the compute nodes and the in-memory database.
2. The method of claim 1, wherein before the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes, the method further comprises:
periodically loading incremental data in a dimension table into the KV
database.
3. The method of claim 1 or 2, wherein the step of according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes comprises:
the local compute node includes a fact table storing area and a local buffering area, reading the key value fields in the fact table from the fact table storing area according to a data-table-correlating request SQL, and then retrieving the data of the values from the local buffering area.
4. The method of claim 3, wherein the local buffering area is a cache.
Date recue / Date received 2021-12-14
5. The method of claim 4, wherein the step of updating the data of the values into the compute nodes comprises:
updating the data of the values into the cache.
6. An apparatus for correlating data tables based on a KV database, comprising:
a correlating and retrieving unit, for according to key value fields in a fact table, retrieving data of corresponding values from local compute nodes;
a first processing unit, for if there is no such data of the values in the local compute nodes, retrieving the data of the values from an in-memory database instead, and updating the retrieved data into the compute nodes; and a second processing unit, for if there is no such data of the values in the in-memory database, retrieving the data of the values from the KV database instead, and updating the retrieved data into the compute nodes and the in-memory database.
7. The apparatus of claim 6, wherein the local compute node includes a fact table storing area and a local buffering area, the correlating and retrieving unit reads the key value fields in the fact table from the fact table storing area according to a data-table-correlating request SQL, and then retrieves the data of the values from the local buffering area.
8. The apparatus of claim 7, wherein the local buffering area is a cache.
9. The apparatus of claim 7, wherein the step of updating the data of the values into the compute nodes comprises:
updating the data of the values into the cache.

Date recue / Date received 2021-12-14
10. A computer-readable storage medium storing therein a computer program, wherein the computer program when executed by a processor performing a method as described in any of claims 1 through 5.

Date recue / Date received 2021-12-14
CA3142143A 2020-12-16 2021-12-14 Method and apparatus for correlating data tables based on kv database Pending CA3142143A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011487204.4A CN112487111A (en) 2020-12-16 2020-12-16 Data table association method and device based on KV database
CN202011487204.4 2020-12-16

Publications (1)

Publication Number Publication Date
CA3142143A1 true CA3142143A1 (en) 2022-06-16

Family

ID=74917278

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3142143A Pending CA3142143A1 (en) 2020-12-16 2021-12-14 Method and apparatus for correlating data tables based on kv database

Country Status (2)

Country Link
CN (1) CN112487111A (en)
CA (1) CA3142143A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114374B (en) * 2022-06-27 2023-03-31 腾讯科技(深圳)有限公司 Transaction execution method and device, computing equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449038B2 (en) * 2012-11-26 2016-09-20 Amazon Technologies, Inc. Streaming restore of a database from a backup system
US9880936B2 (en) * 2014-10-21 2018-01-30 Sybase, Inc. Distributed cache framework
CN107231395A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 Date storage method, device and system
CN109388654A (en) * 2017-08-04 2019-02-26 北京京东尚科信息技术有限公司 A kind of method and apparatus for inquiring tables of data
CN110471914B (en) * 2019-06-27 2022-07-12 苏宁云计算有限公司 Dimension association method and system in real-time data processing

Also Published As

Publication number Publication date
CN112487111A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
US10198363B2 (en) Reducing data I/O using in-memory data structures
US10756759B2 (en) Column domain dictionary compression
US10489403B2 (en) Embracing and exploiting data skew during a join or groupby
US7512597B2 (en) Relational database architecture with dynamic load capability
US20090307329A1 (en) Adaptive file placement in a distributed file system
US11003649B2 (en) Index establishment method and device
US10262025B2 (en) Managing a temporal key property in a database management system
US7941424B2 (en) System, method, and computer-readable medium for dynamic detection and management of data skew in parallel join operations
EP3401807B1 (en) Synopsis based advanced partition elimination
US10303654B2 (en) Hybrid data distribution in a massively parallel processing architecture
US20220222244A1 (en) Constraint Data Statistics
CN110633378A (en) Graph database construction method supporting super-large scale relational network
US10685031B2 (en) Dynamic hash partitioning for large-scale database management systems
WO2021147935A1 (en) Log playback method and apparatus
US20200265087A1 (en) Data extraction using a distributed indexing architecture for databases
US11561937B2 (en) Multitenant application server using a union file system
US6745198B1 (en) Parallel spatial join index
CN111914020A (en) Data synchronization method and device and data query method and device
CN112579595A (en) Data processing method and device, electronic equipment and readable storage medium
CN107562804B (en) Data caching service system and method and terminal
US8131711B2 (en) System, method, and computer-readable medium for partial redistribution, partial duplication of rows of parallel join operation on skewed data
CA3142143A1 (en) Method and apparatus for correlating data tables based on kv database
US20210286817A1 (en) System and method for disjunctive joins using a lookup table
CN111221814B (en) Method, device and equipment for constructing secondary index
CN116126862A (en) Data table association method, device, equipment and storage medium

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916

EEER Examination request

Effective date: 20220916