CA3191210A1 - Data syncronization method and device, computer equipment and storage medium - Google Patents

Data syncronization method and device, computer equipment and storage medium Download PDF

Info

Publication number
CA3191210A1
CA3191210A1 CA3191210A CA3191210A CA3191210A1 CA 3191210 A1 CA3191210 A1 CA 3191210A1 CA 3191210 A CA3191210 A CA 3191210A CA 3191210 A CA3191210 A CA 3191210A CA 3191210 A1 CA3191210 A1 CA 3191210A1
Authority
CA
Canada
Prior art keywords
data
join
temporary
data amount
fact
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CA3191210A
Other languages
French (fr)
Inventor
Rui XIA
Xiaoqing ZHAI
Jinzhong Wang
Sheng Yang
Qian Sun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
10353744 Canada Ltd
Original Assignee
10353744 Canada Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 10353744 Canada Ltd filed Critical 10353744 Canada Ltd
Publication of CA3191210A1 publication Critical patent/CA3191210A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a data synchronization method, apparatus, computer device, and storage medium, comprising: receiving user data synchronization request information, the request information includes field information of data analyzed by user online, obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table, optimizing the join table, partitioning according to data billing time in the optimized join table, saving the partitioned data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization. The synchronization method performs distributed writing of data according to master time specified by user to achieve concurrency of data synchronization and improve the accuracy of data synchronization.

Description

DATA SYNCHRONIZATION METHOD AND DEVICE, COMPUTER EQUIPMENT
AND STORAGE MEDIUM
Technical Field [0001] The present disclosure relates to the big data analysis technology field, particularly to a data synchronization method, apparatus, computer device, and storage medium.
Background
[0002] On-Line Analysis Processing (OLAP) is a rapid analysis technology of sharing multi-dimensional information, the technology uses multi-dimensional database technology to enable user to observe data from different angles. OLAP is mainly used to support complex analysis operations, focusing on decision support for management people, meeting the requirements of analysts for complex queries of large data amount quickly and flexibly, and presenting queries in an intuitive and easy-to-understand form to assist decision-making.
[0003] At present, usually adopting full coverage mode to synchronize data to OLAP platform from data warehouse, the data set in the data warehouse is synchronized to the OLAP platform according to the partition time, the partition time of the data warehouse is data processing time, not master time dimension, the billing time of data recording is the master time dimension, when performing data synchronization operation, the data cannot be synchronized to the OLAP
platform according to the specified master time, the accuracy of data synchronization is reduced, in addition, the data recording of one partition in the data warehouse can only be written to one partition in the OLAP platform, concurrent data synchronization cannot be achieved.
Invention Content
[0004] Based on this, it is necessary to provide a method, apparatus, computer device, and storage medium to tackle the above-mentioned technical problem, the method can perform Date Recue/Date Received 2023-02-27 distributed writing of data according to the master time specified by user to achieve the concurrency of data synchronization and improve the accuracy of data synchronization.
[0005] On the first aspect, providing a data synchronization method, the method comprises:
[0006] Receiving user data synchronization request information, the request information includes field information of data analyzed by user online;
[0007] Obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information;
[0008] Performing a mapjoin operation to skewed data in the join table to obtain an optimized join table;
[0009] Classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data;
[0010] Writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
[0011] In an achievable method, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, comprising:
[0012] Counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table;
[0013] When the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of Date Recue/Date Received 2023-02-27 ClickHouse to complete data synchronization.
[0014] In an achievable method, the method also comprises:
[0015] When the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table;
[0016] When the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
[0017] In an achievable method, the method also comprises:
[0018] When the data amount of the temporary table is consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data in the master table as submission status.
[0019] In an achievable method, the method also comprises:
[0020] When the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table;
[0021] Recording second status information of data in the temporary table as pre-submission status;
[0022] Saving the meta information, the execution information and the second status information to a relational database management system MySQL.

Date Recue/Date Received 2023-02-27
[0023] In an achievable method, the method also comprises:
[0024] Obtaining a to-be-synchronized first fact table and a reverse table within a preset time;
[0025] Multiplying measurement data in the first fact table with a preset value to obtain a prepared fact table;
[0026] Adding measurement data in the prepared fact table and the reverse table to obtain a fact reverse table;
[0027] Joining the fact reverse table for data synchronization.
[0028] The second aspect, a data synchronization apparatus is provided, wherein, the apparatus comprises:
[0029] A receiving module configured to receive user data synchronization request information, the request information includes field information of data analyzed by user online;
[0030] A joining module configured to obtain a fact table and a dimension table corresponding to the field information, left join the fact table and the dimension table to obtain a join table corresponding to the field information;
[0031] An optimizing module configured to perform a mapjoin operation to skewed data in the join table to obtain an optimized join table;
[0032] A partitioning module configured to classify according to data billing time in the optimized join table and save the classified data to corresponding partition of a distributed file system HDFS cluster to obtain partitioned data;
[0033] A synchronizing module configured to write the partitioned data into a temporary Date Recue/Date Received 2023-02-27 table of a column-oriented database management unit ClickHouse for data synchronization.
[0034] In an achievable method, wherein, the synchronizing module is specifically used for:
[0035] Counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table;
[0036] When the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
[0037] The third aspect, a computer device is provided, including a memory, a processor and a computer program stored in the memory and run on the processor configured to achieve the first aspect or any data synchronization method of the first aspect when the processor executes the computer program.
[0038] The fourth aspect, a computer readable storage medium stored with a computer program configured to achieve the first aspect or any data synchronization method of the first aspect when the processor executes the computer program.
[0039] The above-mentioned data synchronization method, computer device, and storage medium, receiving user data synchronization request information, the request information includes field information of data analyzed by user online; obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information;
performing a mapjoin operation to skewed data in the join table to obtain an optimized join table; classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS cluster to obtain partitioned data;
writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, the method can perform distributed writing of data Date Recue/Date Received 2023-02-27 according to master time specified by user to achieve concurrency of data synchronization and improve the accuracy of data synchronization.
Drawing Description
[0040] Figure 1 is an application environment diagram of data synchronization method in an embodiment;
[0041] Figure 2 is a process diagram of data synchronization method in an embodiment;
[0042] Figure 3 is a structural diagram of data synchronization apparatus in an embodiment;
[0043] Figure 4 is an internal structural diagram of a computer device in an embodiment;
Specific embodiment methods
[0044] In order to make clearer application purposes, technical solutions, and advantages, the present application is further explained in detail with a particular embodiment thereof, and with reference to the drawings. It shall be understood that the specific embodiments described here are only used to explain the present application, but not to limit the scope of the present application.
[0045] The data synchronization method provided by the present application can be applied to the data synchronization system shown in Figure 1, the system includes:
data warehouse module 110, OLAP joining module 120, OLAP engine module 130, wherein, the OLAP
joining module includes online analysis and processing data warehouse tool OLAP-HIVE
cluster, the OLAP engine module 130 includes database management unit ClickHouse. The data warehouse module 110 is configured to synchronize the fact table and the latitude table to the OLAP-HIVE cluster, the OLAP-HIVE cluster is configured to write the ClickHouse after the fact table and the latitude table are joined, the ClickHouse is configured to synchronize the Date Recue/Date Received 2023-02-27 joined data.
[0046] In some embodiments, as shown in Figure 2, a data synchronization method is provided, the method comprises following steps:
[0047] S210, receiving user data synchronization request information, the request information includes field information of data analyzed by user online.
[0048] when user performs data analysis through OLAP platform, inputting the query statement to query data through the OLAP platform interface, the interactive interface of the OLAP platform receives the query information input by user, generates request information according to the query statement and sends the request information to the data synchronization system. The data synchronization system receives request information, wherein, the request information includes field information, the field information is keywords for user to obtain online data analysis.
[0049] S220, obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information.
[0050] The fact table is the central table in the data warehouse structure, the fact table contains numeric measurement values and keys linked to the fact and dimension tables, the fact data table contains data describing specific event within the service.
[0051] The dimension table can be seen as window for user to analyze data, the dimension table contains features of fact records in fact data table, some features provide descriptive information, some features specify how to summarize fact data table data to provide useful information for analyst, the dimension table contains hierarchies of attributes to help summarize data.

Date Recue/Date Received 2023-02-27
[0052]
Obtaining the fact table and the dimension table corresponding to the field information from the data warehouse module, joining the fact table and the dimension table through the OLAP-HIVE cluster, in other words, performing left join on the fact table and the dimension table, considering one of the tables as the left table, and the other table as the right table, all the data in the left table will be displayed in the join table, the data of the right table meets the field information conditions, the area in the left table corresponds to the area with no data in the right table is null. By joining the fact table and the dimension table, the data used by user for online analysis is associated with one table to facilitate user to analyze data more conveniently and intuitively.
[0053] S230, performing a mapjoin operation to skewed data in the join table to obtain an optimized join table.
[0054] During the join process of the fact table and the dimension table, if the data amount corresponding to each dimension in the table is quite different and existing a particularly large amount of data corresponding to one or several dimensions, data skew will be caused, and the data skew will extend the data synchronization time, therefore, optimizing the join table is required and performing mapjoin operation on the join table.
[0055] Divide the skewed data in the join table into large table and small table, load the small table to the memory, scan the large table sequentially, directly perform join operation on the map side to obtain the optimized join table. Since the optimized join table performs data synchronization, the impact of skewed data is greatly reduced, the time for data synchronization is reduced, and the data synchronization speed is improved.
[0056] S240, classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data.
[0057] Classifying the data with same billing time in the optimized join table into one Date Recue/Date Received 2023-02-27 category, the partitions in the distributed file system (Hadoop Distributed File System, HDFS) cluster are divided by the billing date, the classified data is saved to the path corresponding to the HDFS cluster and added to the Hive partition, for example, classifying the data with billing date of 2021.12.25 into one category, classifying the data with the billing date of 2021.12.25 into one category, saving the data with the billing date of 2021.12.25 to the 2021.12.25 partition in the HDFS cluster, and saving the data with the billing date of 2021.12.26 in the 2021.12.25 partition of the HDFS cluster.
[0058] Wherein, the classified data is saved to the corresponding partition of the HDFS
cluster in the form of a global lock to ensure concurrent data synchronization.
[0059] S250, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
[0060] According to the billing time of the partitioned data, writing all the partitioned data into the position of ClickHouse temporary table corresponding to the billing time, then synchronizing to the master table to complete the data synchronization.
[0061] In the embodiment of the present application, through the method of receiving user data synchronization request information, the request information includes field information of data analyzed by user online, obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information, performing a mapjoin operation to skewed data in the join table to obtain an optimized join table, classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS cluster to obtain partitioned data, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, the method can perform distributed writing of data according to master time specified by user to achieve concurrency of data synchronization and improve the accuracy of data synchronization.

Date Recue/Date Received 2023-02-27
[0062] In some embodiments, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, comprising:
[0063] Counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table;
[0064] When the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
[0065] Writing partitioned data into the ClickHouse temporary table, counting the data amount of the temporary table and the optimized join table, when the data amount of the temporary table is consistent with the data amount of the optimized join table, indicating that the HDFS cluster has accurately stored data, if synchronization task is successful, synchronizing the data of the temporary table to the master table through attach-partition-from method for presenting to user.
[0066] In some embodiments, the method also comprises:
[0067] When the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table;
[0068] When the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
[0069] When the data amount of the temporary table is inconsistent with the data amount of the join table, indicating that the HDFS cluster has not completely written the partitioned data into the ClickHouse temporary table, the synchronization task has failed, at this time, the data Date Recue/Date Received 2023-02-27 in the temporary table cannot be presented to user synchronously with the master table, the data synchronization needs to be performed again. Rewriting the partitioned data into the temporary table of ClickHouse to obtain a new temporary table, in other words, rewriting the temporary table, then comparing the data amount of the re-written temporary table with the data amount of the join table again, if still inconsistent, continue to rewrite the partitioned data into the ClickHouse temporary table until the data amount in the re-written temporary table is consistent with the data amount of the optimized join table, then writing the partitioned data of the re-written temporary table to the master table of ClickHouse to complete data synchronization and ensure the accuracy of data synchronization.
[0070] In some embodiments, the method also comprises:
[0071] When the data amount of the temporary table is consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data in the master table as submission status.
[0072] Transaction is data consistency maintenance unit of a database, transitioning the database from a consistent status to a new consistent status, in short, a set of processing steps is called a transaction if either all or none of them are executed. Since the data synchronization by a plurality of nodes cannot all succeed, in order to ensure the integrity and reliability of the synchronized data, deploying corresponding distributed transaction during the data synchronization process.
[0073] The first status information is the status information of the data in the master table that has been successfully synchronized, the submission status indicates that the transaction is ended and all steps of data synchronization are completed. After the synchronization task is successful and the data of the temporary table is written to the master table, recording the status of data in the master table as submission status.

Date Recue/Date Received 2023-02-27
[0074] In some embodiments, the method also comprises:
[0075] When the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table;
[0076] Recording second status information of data in the temporary table as pre-submission status;
[0077] Saving the meta information, the execution information and the second status information to a relational database management system MySQL.
[0078] The meta information is the service description information of the data, the execution information includes data synchronization failure information and data synchronization success information, when the data amount of the temporary table is inconsistent with the data amount of join table, the data synchronization failure information and the meta information of the temporary table are obtained. The second status information is the status information of data in the master table identified by synchronization, the pre-submission status indicates the end of transaction, all data synchronization steps have failed, the ClickHouse data synchronization needs to be performed again. When the synchronization task fails, recording the status of data in the temporary table as pre-submission, rewriting the partitioned data into the temporary table. Meanwhile, saving meta information, execution information and second status information in the relational database management system MySQL, updating the data records in My SQL.
[0079] After writing all the data in the temporary table to the master table, verifying whether the data amount of the temporary table is consistent with the data amount of the master table, if not, re-synchronizing the ClickHouse data, if the data is consistent, restoring the execution information of ClickHouse and unlock the global lock.
[0080] When the ClickHouse data synchronization process cannot guarantee the transaction, Date Recue/Date Received 2023-02-27 all the partitioned data will be covered in the master table of ClickHouse.
[0081] In some embodiments, the method also comprises:
[0082] Obtaining a to-be-synchronized first fact table and a reverse table within a preset time;
[0083] Multiplying measurement data in the first fact table with a preset value to obtain a prepared fact table;
[0084] Adding measurement data in the prepared fact table and the reverse table to obtain a fact reverse table;
[0085] Joining the fact reverse table for data synchronization.
[0086] In some cases, synchronizing the data of the current day is required and named as reverse supplement, the reverse data of the current day is only valid for the current day and does not affect the scheduling data of the next day.
[0087] The preset time is the day specified by user, and the preset data is -1, obtaining the fact table, the dimension table and the reverse table of the current day, multiplying the metering data in the fact table by -1 to obtain the prepared fact table, adding the metering data in the fact table and the reverse table to obtain the fact reverse table, joining the fact reverse table and synchronizing the joined fact reverse table to the master table of ClickHouse, so as to present in report.
[0088] What should be noted is although the steps of the above-mentioned process diagram in Figure 2 are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly provided instruction in this article, there is no strict order in which these steps can be performed, and they can be performed in any other orders. In addition, at least parts of the appended drawings in the steps of Figure Date Recue/Date Received 2023-02-27 2 can include more sub steps or multiple stages, these sub steps or stages are not necessarily completed at the same time but can be executed in different time, the execution order of these sub steps or stages is also not necessarily in sequence order but can be performed alternately with the other steps or sub steps of other steps or at least one part of the other stages.
[0089] In some embodiments, as shown in Figure 3, a data synchronization apparatus is provided, the apparatus comprises: receiving module 310, joining module 320, optimizing module 330, partitioning module 340 and synchronizing module 350, wherein:
[0090] A receiving module 310 configured to receive user data synchronization request information, the request information includes field information of data analyzed by user online;
[0091] A joining module 320 configured to obtain a fact table and a dimension table corresponding to the field information, left join the fact table and the dimension table to obtain a join table corresponding to the field information;
[0092] An optimizing module 330 configured to perform a mapjoin operation to skewed data in the join table to obtain an optimized join table;
[0093] A partitioning module 340 configured to classify according to data billing time in the optimized join table and save the classified data to corresponding partition of a distributed file system HDFS cluster to obtain partitioned data;
[0094] A synchronizing module 350 configured to write the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
[0095] In the embodiments of the present application, the method can perform distributed writing of data according to master time specified by user to achieve concurrency of data synchronization and improve the accuracy of data synchronization.

Date Recue/Date Received 2023-02-27
[0096] In some embodiments, the synchronizing module is specifically used for:
[0097] Counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table;
[0098] When the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
[0099] In some embodiments, the apparatus also includes: a rewriting apparatus configured to,
[0100] When the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table;
[0101] When the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
[0102] In some embodiments, the apparatus also includes: a recording module configured to,
[0103] When the data amount of the temporary table is consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data in the master table as submission status.
[0104] In some embodiments, the apparatus also includes:
Date Recue/Date Received 2023-02-27
[0105] An obtaining module 380 configured to when the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table;
[0106] A recording module 370 configured to record second status information of data in the temporary table as pre-submission status;
[0107] A storing module 390 configured to save the meta information, the execution information and the second status information to a relational database management system MySQL.
[0108] In some embodiments, the apparatus also includes:
[0109] An obtaining module 380 configured to obtain a to-be-synchronized first fact table and a reverse table within a preset time;
[0110] A multiplying module 3100 configured to multiply measurement data in the first fact table with a preset value to obtain a prepared fact table;
[0111] An adding module configured to add measurement data in the prepared fact table and the reverse table to obtain a fact reverse table;
[0112] A synchronizing module configured to join the fact reverse table for data synchronization.
[0113] For the specific limitation of data synchronization apparatus can refer to the above-mentioned data synchronization method, which will not be repeated here. Each module of the above data synchronization apparatus can be achieved fully or partly by software, hardware, and their combinations. The above modules can be embedded in the processor or independent of the processor in computer device and can store in the memory of computer device in form Date Recue/Date Received 2023-02-27 of software, so that the processor can call and execute the operations corresponding to the above modules.
[0114] In some embodiments, a computer device is provided, the computer device can be a server and whose internal structure diagram is shown in Figure 4. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
The processor of the computer device is configured to provide calculation and control capabilities. The memory of the computer device includes non-volatile storage medium and internal memory. The memory of non-volatile storage medium has an operation system, computer programs and database. The internal memory provides an environment for the operation system and computer program running in a non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to implement a data synchronization method.
[0115] The skilled in the art can understand that the structure shown in Figure 4 is only partial structural diagram related this application solution and not constitute limitation to the computer device applied on the current application solution, the specific computer device can include more or less components than what is shown in the figure, or combinations of some components or different components to what is shown in the figure.
[0116] In some embodiments, a computer device is provided, including a memory, a processor and a computer program stored in the memory and ran on the processor configured to achieve the following steps when the processor executes the computer program:
[0117] Receiving user data synchronization request information, the request information includes field information of data analyzed by user online;
[0118] Obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the Date Recue/Date Received 2023-02-27 field information;
[0119] Performing a mapjoin operation to skewed data in the join table to obtain an optimized join table;
[0120] Classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data;
[0121] Writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
[0122] In some embodiments, the processor performs the following steps when executing the computer program: writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, comprising:
counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table; when the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
[0123] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: when the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table; when the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
[0124] In some embodiments, the processor performs the following steps when executing the computer program: method also comprises: when the data amount of the temporary table is Date Recue/Date Received 2023-02-27 consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data in the master table as submission status.
[0125] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: when the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table; recording second status information of data in the temporary table as pre-submission status; saving the meta information, the execution information and the second status information to a relational database management system MySQL.
[0126] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: obtaining a to-be-synchronized first fact table and a reverse table within a preset time; multiplying measurement data in the first fact table with a preset value to obtain a prepared fact table; adding measurement data in the prepared fact table and the reverse table to obtain a fact reverse table; joining the fact reverse table for data synchronization.
[0127] In an embodiment, a computer readable storage medium is provided, the medium stored with computer program and the processor performs the following steps when executing the computer program:
[0128] Receiving user data synchronization request information, the request information includes field information of data analyzed by user online;
[0129] Obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information;

Date Recue/Date Received 2023-02-27
[0130] Performing a mapjoin operation to skewed data in the join table to obtain an optimized join table;
[0131] Classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data;
[0132] Writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
[0133] In some embodiments, the processor performs the following steps when executing the computer program: writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, comprising:
counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table; when the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
[0134] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: when the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table; when the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
[0135] In some embodiments, the processor performs the following steps when executing the computer program: method also comprises: when the data amount of the temporary table is consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data Date Recue/Date Received 2023-02-27 in the master table as submission status.
[0136] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: when the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table; recording second status information of data in the temporary table as pre-submission status; saving the meta information, the execution information and the second status information to a relational database management system MySQL.
[0137] In some embodiments, the processor performs the following steps when executing the computer program: the method also comprises: obtaining a to-be-synchronized first fact table and a reverse table within a preset time; multiplying measurement data in the first fact table with a preset value to obtain a prepared fact table; adding measurement data in the prepared fact table and the reverse table to obtain a fact reverse table; joining the fact reverse table for data synchronization.
[0138] The skilled in the art can understand that all or partial of procedures from the above-mentioned methods can be performed by computer program instructions through related hardware, the mentioned computer program can be stored in a non-volatile material computer readable storage medium, this computer can include various embodiment procedures from the abovementioned methods when execution. Any reference to the memory, the storage, the database, or the other media used in each embodiment provided in current application can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programable ROM (PROM), electrically programmable ROM (EPRPMD), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. As an instruction but not limited to, RAM is available in many forms such as static RAM (SRAM), dynamic RAM
(DRAMD), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SRAM (ESDRAM), synchronal link (Synchlink) DRAM (SLDRAM), memory bus Date Recue/Date Received 2023-02-27 (Rambus), direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0139] The technical features of the above-mentioned embodiments can be randomly combined, for concisely statement, not all possible combinations of technical features in the abovementioned embodiments are described. However, if there are no conflicts in the combinations of these technical features, it shall be within the scope of this description.
[0140] The above-mentioned embodiments are only several embodiments in this disclosure and the description is more specific and detailed but cannot be understood as the limitation of the scope of the invention patent. Evidently those ordinary skilled in the art can make various modifications and variations to the disclosure without departing from the spirit and scope of the disclosure. Therefore, the appended claims are intended to be construed as encompassing the described embodiment and all the modifications and variations coming into the scope of the disclosure.

Date Recue/Date Received 2023-02-27

Claims (10)

Claims:
1. A data synchronization method comprises:
receiving user data synchronization request information, the request information includes field information of data analyzed by user online;
obtaining a fact table and a dimension table corresponding to the field information, left joining the fact table and the dimension table to obtain a join table corresponding to the field information;
performing a mapjoin operation to skewed data in the join table to obtain an optimized join table;
classifying according to data billing time in the optimized join table, saving the classified data to corresponding partition of a distributed file system HDFS
cluster to obtain partitioned data; and writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
2. The method according to claim 1, wherein, writing the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization, comprising:
counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table; and when the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.

Date Recue/Date Received 2023-02-27
3. The method according to claim 2, wherein, the method also comprises:
when the data amount of the temporary table is inconsistent with the data amount of the join table, re-writing the partitioned data into the temporary table of ClickHouse to obtain a rewritten temporary table; and when the data amount in the rewritten temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the rewritten temporary table into a master table of ClickHouse to complete data synchronization.
4. The method according to claim 2, wherein, the method also comprises:
when the data amount of the temporary table is consistent with the data amount of the optimized join table, after writing the partitioned data in the temporary table into a master table of ClickHouse, recording first status information of data in the master table as submission status.
5. The method according to claim 3, wherein, the method also comprises:
when the data amount of the temporary table is inconsistent with the data amount of the join table, obtaining meta information and execution information of the temporary table;
recording second status information of data in the temporary table as pre-submission status; and saving the meta information, the execution information and the second status information to a relational database management system MySQL.
6. The method according to claim 1, wherein, the method also comprises:

Date Recue/Date Received 2023-02-27 obtaining a to-be-synchronized first fact table and a reverse table within a preset time;
multiplying measurement data in the first fact table with a preset value to obtain a prepared fact table;
adding measurement data in the prepared fact table and the reverse table to obtain a fact reverse table; and joining the fact reverse table for data synchronization.
7. A data synchronization apparatus, wherein, the apparatus comprises:
a receiving module configured to receive user data synchronization request information, the request information includes field information of data analyzed by user online;
a joining module configured to obtain a fact table and a dimension table corresponding to the field information, left join the fact table and the dimension table to obtain a join table corresponding to the field information;
an optimizing module configured to perform a mapjoin operation to skewed data in the join table to obtain an optimized join table;
a partitioning module configured to classify according to data billing time in the optimized join table and save the classified data to corresponding partition of a distributed file system HDFS cluster to obtain partitioned data; and a synchronizing module configured to write the partitioned data into a temporary table of a column-oriented database management unit ClickHouse for data synchronization.
Date Recue/Date Received 2023-02-27
8. The apparatus according to claim 7, wherein, the synchronizing module is specifically used for:
counting data amount of the temporary table and the optimized join table to obtain the data amount of the temporary table and the data amount of the join table; and when the data amount of the temporary table is consistent with the data amount of the optimized join table, writing the partitioned data in the temporary table into a master table of ClickHouse to complete data synchronization.
9. A computer device, including a memory, a processor and a computer program stored in the memory and run on the processor configured to achieve the steps of any methods in claim 1 to 6 when the processor executes the computer program.
10. A computer readable storage medium stored with a computer program configured to achieve the steps of any methods in claim 1 to 6 when the processor executes the computer program.

Date Recue/Date Received 2023-02-27
CA3191210A 2022-02-25 2023-02-27 Data syncronization method and device, computer equipment and storage medium Pending CA3191210A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210178534.8 2022-02-25
CN202210178534.8A CN114579567A (en) 2022-02-25 2022-02-25 Data synchronization method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CA3191210A1 true CA3191210A1 (en) 2023-08-25

Family

ID=81775250

Family Applications (1)

Application Number Title Priority Date Filing Date
CA3191210A Pending CA3191210A1 (en) 2022-02-25 2023-02-27 Data syncronization method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114579567A (en)
CA (1) CA3191210A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718787B (en) * 2023-01-09 2023-05-05 百融至信(北京)科技有限公司 Data table data synchronization method, query method, electronic device and storage medium

Also Published As

Publication number Publication date
CN114579567A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
US11468062B2 (en) Order-independent multi-record hash generation and data filtering
US10554771B2 (en) Parallelized replay of captured database workload
US10262002B2 (en) Consistent execution of partial queries in hybrid DBMS
US7844570B2 (en) Database generation systems and methods
CN110209650B (en) Data normalization and migration method and device, computer equipment and storage medium
US8078579B2 (en) Data source currency tracking and currency based execution
US11868330B2 (en) Method for indexing data in storage engine and related apparatus
CN110188114B (en) Data operation optimization method, device, system, equipment and storage medium
CN110489092B (en) Method for solving read data delay problem under database read-write separation architecture
US20180300147A1 (en) Database Operating Method and Apparatus
CN115145943B (en) Method, system, equipment and storage medium for rapidly comparing metadata of multiple data sources
RU2711348C1 (en) Method and system for processing requests in a distributed database
CA3191210A1 (en) Data syncronization method and device, computer equipment and storage medium
Margara et al. A model and survey of distributed data-intensive systems
CN111522881B (en) Service data processing method, device, server and storage medium
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium
US9009098B1 (en) Methods and apparatus for creating a centralized data store
Mazumdar et al. The Data Lakehouse: Data Warehousing and More
EP3783502A1 (en) System for persisting application program data objects
CN116755699A (en) Compiling processing method, compiling processing device, electronic equipment and storage medium
CN112765126B (en) Database transaction management method, device, computer equipment and storage medium
CN116303822A (en) Data warehouse management method, device, computer equipment and storage medium
CN115098503A (en) Null value data processing method and device, computer equipment and storage medium
CN109710698A (en) A kind of data assemblage method, device, electronic equipment and medium
JP2023546818A (en) Transaction processing method, device, electronic device, and computer program for database system

Legal Events

Date Code Title Description
EEER Examination request

Effective date: 20230919

EEER Examination request

Effective date: 20230919

EEER Examination request

Effective date: 20230919

EEER Examination request

Effective date: 20230919

EEER Examination request

Effective date: 20230919

EEER Examination request

Effective date: 20230919