CN111680017A - Data synchronization method and device - Google Patents

Data synchronization method and device Download PDF

Info

Publication number
CN111680017A
CN111680017A CN202010615069.0A CN202010615069A CN111680017A CN 111680017 A CN111680017 A CN 111680017A CN 202010615069 A CN202010615069 A CN 202010615069A CN 111680017 A CN111680017 A CN 111680017A
Authority
CN
China
Prior art keywords
event
synchronization
file system
operation event
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010615069.0A
Other languages
Chinese (zh)
Inventor
蒋超
谢健
邸帅
卢道和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010615069.0A priority Critical patent/CN111680017A/en
Publication of CN111680017A publication Critical patent/CN111680017A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/119Details of migration of file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention relates to the field of financial technology (Fintech), and discloses a data synchronization method and a data synchronization device, wherein the method comprises the following steps: monitoring a first operation event aiming at metadata of a source data warehouse; determining a second operation event from the first operation event according to a preset rule, and generating each synchronous task corresponding to the second operation event; and executing each synchronization task so as to realize the synchronization of the source file system and the destination file system. The invention starts from the change reason of the source data and realizes more accurate and efficient data synchronization. The source data change condition is accurately, timely and effectively stored and aggregated, and then the source data change condition is synchronized to a target file system, so that time-consuming operation caused by recursive traversal of a data path can be avoided, and the performance and efficiency of incremental synchronization are improved by monitoring and intercepting the source data change condition of a source end, and the data comparison cost is greatly reduced.

Description

Data synchronization method and device
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a data synchronization method and device.
Background
With the development of computer technology, more and more technologies (such as distributed architecture, cloud computing or big data) are applied in the financial field, the traditional financial industry is gradually changing to the financial technology, and big data technology is no exception, but higher requirements are also put forward on big data technology due to the security and real-time requirements of the financial and payment industries.
In the cluster data migration process, data and metadata of a source cluster are generally synchronized to a destination cluster end in a full amount at a time, so that data of the destination cluster and data of the source cluster are basically consistent, and then newly added data of the source cluster are periodically synchronized to the destination cluster in an incremental synchronization mode, so that data of the clusters at two ends are dynamically consistent.
The full-scale synchronization has the main function of greatly reducing the data difference of the source cluster and the target cluster, and creates conditions for the incremental synchronization to keep the clusters at the two ends basically and completely consistent.
In the existing cross-cluster data synchronization scheme, the difference of metadata related to a source cluster and a target cluster is indirectly judged through the comparison of data paths of the source cluster and the target cluster in the incremental synchronization process, and the metadata is synchronized according to the difference. Specifically, the incremental synchronization generally performs task configuration and data comparison according to a library level (or a table level and a partition level), and the data comparison uses path data for comparison, including file size, directory number, file number, and the like. There are generally several cases in the alignment process:
(1) if the source cluster and the target cluster are inconsistent in data and the target cluster data are more than the source cluster, deleting redundant data of the target cluster data;
(2) the source cluster and the target cluster are inconsistent in data, and if the target cluster data are less than the source cluster, the corresponding data of the source cluster data are copied;
(3) if the data comparison at the two ends is completely consistent, the comparison is successful and the processing is not carried out.
According to the scheme, on one hand, all data paths in the database table need to be traversed and compared, and the efficiency is low; on the other hand, the comparison of the data paths is used as a basis for indirectly judging the difference between the metadata of the target cluster and the metadata of the source cluster, so that the misjudgment probability is high, and the cost of the target cluster for repeatedly synchronizing the data and the metadata is increased.
Disclosure of Invention
The application provides a data synchronization method and device, which are used for solving the problem of how to accurately and efficiently perform data synchronization.
In a first aspect, an embodiment of the present application provides a data synchronization method, which is applicable to a distributed file system having a data warehouse; the method comprises the following steps:
monitoring a first operation event aiming at metadata of a source data warehouse;
determining a second operation event from the first operation events according to a preset rule, wherein the second operation event refers to an operation event which is associated with change of source data stored in a source file system;
generating each synchronous task corresponding to the second operation event;
and executing the synchronization tasks so as to realize the synchronization of the source file system and the destination file system.
According to the scheme, different from the prior art that the comparison of data paths is used as a basis for indirectly judging the metadata difference between the target file system and the source file system, the first operation event of the metadata of the source data warehouse is directly monitored, namely starting from the change reason of the source data, and more accurate and efficient data synchronization is realized. The source data change condition is accurately, timely and effectively stored and aggregated, and then the source data change condition is synchronized to a target file system, so that time-consuming operation caused by recursive traversal of a data path can be avoided, and the performance and efficiency of incremental synchronization are improved by monitoring and intercepting the source data change condition of a source end, and the data comparison cost is greatly reduced.
Optionally, each synchronization task corresponds to an operation event;
executing the synchronization tasks to achieve synchronization of the source file system and the destination file system, including:
if the synchronization task is directed to the metadata of the source data warehouse, the metadata of the source data warehouse is synchronized to the target data warehouse by executing the synchronization task;
and if the synchronization task is directed at source data, after the synchronization task is executed, acquiring the source data corresponding to the synchronization task from the source file system and sending the source data to a destination file system.
According to the scheme, different modes are generated according to whether the synchronization task is directed at the metadata of the source data warehouse or the source data, the synchronization task is directly executed according to the metadata to synchronize the metadata of the destination data warehouse, and after the synchronization task is executed according to the source data, the source data corresponding to the synchronization task is acquired from the source file system and sent to the destination file system, so that the data synchronization efficiency is improved.
Optionally, generating each synchronization task corresponding to the second operation event includes:
determining a third operation event aiming at the same operation object from the second operation events;
processing each operation object according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object;
and generating each synchronous task corresponding to the fourth operation event of each operation object.
According to the scheme, the classification and combination of the operation events are realized by processing according to the operation time and the operation type of the operation object, so that the number of the executed synchronous tasks is reduced, and the risk and hidden danger of event loss in the data stream transfer process are avoided.
Optionally, the processing according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object includes:
deleting a third operation event with the operation time before the operation event of the deletion type aiming at the operation event of the deletion type; or
Combining all third operation events through parameter aggregation aiming at the newly-built or modified type operation events; or
For the operation event of the insertion type, if the insertion attribute is the overwriting, deleting a third operation time of which the operation time is before the latest insertion event; and if the insert attribute is non-overwrite writing, merging the third operation events.
According to the scheme, the number of synchronous tasks is reduced by removing invalid operation events (such as deletion types), combining the operation events (such as new or modified types) according to parameters and writing the combined operation events (such as insertion types) according to whether the combined operation events are written in an overlapping mode, and the operation frequency of metadata and data is further reduced. Therefore, the number of executing synchronous tasks is reduced to the maximum extent, and the risk and hidden danger of event loss in the data stream transfer process are avoided.
Optionally, determining a second operation event from the first operation event includes:
determining a second operation event containing set metadata operation primitives from the first operation event, wherein the set metadata operation primitives at least comprise one of the following items:
creating a partition, modifying a partition, deleting a partition, creating a new table, modifying a table, deleting a table, and inserting data.
According to the scheme, the operation events are normalized through the metadata operation primitives, so that the subsequent synchronization task is more efficient.
Optionally, monitoring a first operation event of metadata for the source data warehouse includes:
and carrying out batching on the monitored operation events aiming at the metadata of the source data warehouse, thereby obtaining the first operation event of each batch.
According to the scheme, the operation events are divided into batches, and the efficiency of data synchronization is improved.
Optionally, after executing each synchronization task to synchronize the source file system and the destination file system, the method further includes:
and determining whether the source file system and the destination file system are synchronized or not in a path comparison mode.
According to the scheme, after data synchronization, the data are checked in a path comparison mode, and the synchronization accuracy is improved.
In a second aspect, an embodiment of the present application provides an apparatus for data synchronization, where the apparatus includes:
the monitoring module is used for monitoring a first operation event aiming at the metadata of the source data warehouse;
the processing module is used for determining a second operation event from the first operation event according to a preset rule, wherein the second operation event refers to an operation event which is changed and associated with source data stored in a source file system;
generating each synchronous task corresponding to the second operation event;
and executing the synchronization tasks so as to realize the synchronization of the source file system and the destination file system.
Optionally, each synchronization task corresponds to an operation event;
the processing module specifically comprises:
if the synchronization task is directed to the metadata of the source data warehouse, the metadata of the source data warehouse is synchronized to the target data warehouse by executing the synchronization task;
and if the synchronization task is directed at source data, after the synchronization task is executed, acquiring the source data corresponding to the synchronization task from the source file system and sending the source data to a destination file system.
Optionally, the processing module specifically includes:
determining a third operation event aiming at the same operation object from the second operation events;
processing each operation object according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object;
and generating each synchronous task corresponding to the fourth operation event of each operation object.
Optionally, the processing module specifically includes:
deleting a third operation event with the operation time before the operation event of the deletion type aiming at the operation event of the deletion type; or
Combining all third operation events through parameter aggregation aiming at the newly-built or modified type operation events; or
For the operation event of the insertion type, if the insertion attribute is the overwriting, deleting a third operation time of which the operation time is before the latest insertion event; and if the insert attribute is non-overwrite writing, merging the third operation events.
Optionally, the processing module specifically includes:
determining a second operation event containing set metadata operation primitives from the first operation event, wherein the set metadata operation primitives at least comprise one of the following items:
creating a partition, modifying a partition, deleting a partition, creating a new table, modifying a table, deleting a table, and inserting data.
Optionally, the monitoring module specifically includes:
and carrying out batching on the monitored operation events aiming at the metadata of the source data warehouse, thereby obtaining the first operation event of each batch.
Optionally, after the executing the synchronization tasks to achieve synchronization between the source file system and the destination file system, the processing module further includes:
and determining whether the source file system and the destination file system are synchronized or not in a path comparison mode.
Correspondingly, an embodiment of the present invention further provides a computing device, including:
a memory for storing program instructions;
and the processor is used for calling the program instructions stored in the memory and executing the data synchronization method according to the obtained program.
Accordingly, embodiments of the present invention also provide a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer-readable instructions are read and executed by a computer, the computer-readable instructions cause the computer to perform the above-mentioned data synchronization method.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of data synchronization according to an embodiment of the present invention;
fig. 2 is a system framework of a method for data synchronization according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for data synchronization according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating a method for data synchronization according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a data synchronization apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
First, some terms in the present application are explained so as to be understood by those skilled in the art.
Transport: a set of accurate, stable and efficient data synchronization tools based on a MapReduce computing framework supports TB-level and PB-level data synchronization of an offline data platform.
Hms (hive metastore): the HMS provides services for clients to access all Metadata (Metadata) such as tables and partitions of Hive.
Event listening (Event Listener): HMS provides an Event subscription based notification mechanism, called Event Listener, for subscribing to and processing Event notification messages.
Full quantity synchronization: and copying the historical data according to a certain period, wherein the inventory data of the target cluster and the source cluster are kept consistent.
Incremental synchronization: and copying the newly added data after the update time or the check point (Checkpoint), wherein the destination cluster and the newly added data of the source cluster are consistent.
Before describing embodiments of the present invention, the prior art and the basic principles are explained as follows to better understand the present invention.
It should be noted that the present invention is mainly based on a Hadoop Distributed File System (HDFS).
HDFS is designed to fit distributed file systems running on general purpose hardware (comfort hardware). It has many similarities with existing distributed file systems. But at the same time, its distinction from other distributed file systems is also clear. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS provides high throughput data access and is well suited for application on large-scale data sets. HDFS achieves the purpose of streaming read file system data.
Because when a file on the HDFS is queried, a pile of codes needs to be written manually, in order to solve the problem, Hive is generated, and data on the HDFS is queried, counted and analyzed in an SQL statement mode based on a uniform query analysis layer.
Further, as shown in fig. 1, the relationship between Hive and Hadoop (MapReduce, HDFS) can be clearly seen.
Specifically, Hive is the top layer, i.e., the client layer or job submission layer. MapReduce is an intermediate layer, namely a computation layer. The HDFS is the bottom layer, i.e. the storage layer. Hive stores the metadata in a database. The metadata in Hive includes the name of the table, the partition of the table and its attributes, the directory where the data of the table is located, and so on.
When the user sends out the sql message, the Hive converts the processing of the sql message into MapReduce, then submits the task to Hadoop, and then records the metadata into a database.
Based on this, in the data synchronization scheme based on HDFS data comparison in the prior art, the difference between the metadata related to the source file system and the sink table of the destination file system is indirectly determined by comparing the HDFS data paths corresponding to the source file system and the sink table of the destination file system in the incremental synchronization process, and the metadata is synchronized according to the difference.
Furthermore, in the data migration process, the source data (HDFS data) and the metadata of the source file system are generally synchronized to the destination file system all at once, so that the destination file system and the source file system are basically consistent, and then the newly added data of the source file system is periodically synchronized to the destination file system in an incremental synchronization manner, so that the data at both ends are dynamically consistent.
The full-volume synchronization has the main function of greatly reducing the data difference of the source file system and the destination file system, and creates conditions for the incremental synchronization to keep the two ends basically and completely consistent. The incremental synchronization generally performs task configuration and data comparison according to a library level (or a table level and a partition level), and the data comparison is performed by using a Hive table corresponding to HDFS path data, including file size, directory number, file number and the like.
However, in the prior art, on one hand, each time a library-level task configured by an incremental synchronization task is started and operated, all HDFS data paths in a library table need to be traversed and compared, and the efficiency is low; on the other hand, the comparison of the HDFS data paths is used as a basis for indirectly judging the metadata difference between the destination file system and the Hive table of the source file system, so that the misjudgment probability is high, and the cost of repeated synchronous data and metadata of the destination file system is increased.
Based on this, the embodiment of the present invention provides a method for data synchronization, and the method for data synchronization provided by the embodiment of the present invention may be applied to a system architecture as shown in fig. 2, where the system architecture includes a source file system 100, a destination file system 200, and a synchronization server 300.
Therein, the synchronization server 300 is configured to listen for a first operational event of metadata with respect to a source data repository.
And determining a second operation event from the first operation events, generating synchronization tasks corresponding to the second operation event, and executing the synchronization tasks so as to realize synchronization of the source file system 100 and the destination file system 200.
It should be noted that fig. 2 is only an example of the system architecture according to the embodiment of the present application, and the present application is not limited to this specifically. In practice, the synchronization server 300 may be a part of the source file system 100, implemented by various functions provided by the source file system 100 itself.
Based on the above illustrated system architecture, fig. 3 is a schematic flow chart corresponding to a data synchronization method provided in an embodiment of the present invention, as shown in fig. 3, the method includes:
at step 301, a first operational event for metadata of a source data repository is monitored.
Step 302, determining a second operation event from the first operation event according to a preset rule.
It should be noted that the second operation event refers to an operation event that is associated with a change in source data stored in the source file system.
Step 303, generating each synchronization task corresponding to the second operation event.
Step 304, each synchronization task is executed to achieve synchronization of the source and destination file systems.
According to the scheme, different from the prior art that the comparison of data paths is used as a basis for indirectly judging the metadata difference between the target file system and the source file system, the first operation event of the metadata of the source data warehouse is directly monitored, namely starting from the change reason of the source data, and more accurate and efficient data synchronization is realized. The source data change condition is accurately, timely and effectively stored and aggregated, and then the source data change condition is synchronized to a target file system, so that time-consuming operation caused by recursive traversal of a data path can be avoided, and the performance and efficiency of incremental synchronization are improved by monitoring and intercepting the source data change condition of a source end, and the data comparison cost is greatly reduced.
Specifically, in step 201, the operation events of the monitored metadata for the source data warehouse are batched, so as to obtain a first operation event of each batch.
It should be noted that the first operation event of each batch corresponds to a batch number.
In one possible embodiment, the batching of the operating events is performed by time division, for example, a plurality of operating events every 5 minutes is formed into a batch.
In step 301, for example, a preset rule is first opened to determine a second operation event, and the preset rule may be selected as an HMS configuration live.
It should be noted that the operation event includes commonly used operations of creating a table create table, clearing a table drop table, creating a database create database, clearing a database drop database, and the like.
Furthermore, the source end HMS metadata change situation can be monitored and monitored based on the monitoring of the HMS operation event, so that the data comparison overhead is greatly reduced, and the performance and efficiency of incremental synchronization are improved. However, as the HMS is a carrier of metadata of a data warehouse, a large number of HMS operations are generated by daily operation of a large data platform, and statistics reaches millions to tens of millions; therefore, the following drawbacks occur:
on the one hand, playing back operation events of huge scale can generate a large number of synchronous tasks; on the other hand, a large number of HMS operation events often have redundant and invalid operation events, and meanwhile, a large number of HMS operation events have a risk of loss in the data transfer process, and the risk is hidden and is not easy to discover and monitor, so that great hidden danger is caused to daily production.
Based on this, the second operation event is determined from the first operation event through step 302, i.e. the first operation event is preprocessed, so that the above-mentioned drawback is overcome.
Specifically, a second operation event including a set metadata operation primitive is determined from the first operation event, where the set metadata operation primitive includes at least one of the following items:
creating a partition, modifying a partition, deleting a partition, creating a new table, modifying a table, deleting a table, and inserting data.
For example, 7 metadata operation primitives are defined in an Event PlayBack module (Event play back), and the metadata operation primitives correspond to real HMS operation events one to one. As shown in table one, 7 basic metadata operation primitives are defined.
As can be seen, the 7 metadata operation primitives are ADD _ PARTITION, ALTER _ PARTITION, DROP _ PARTITION, CREATE _ TABLE, INSERT, DROP _ TABLE, ALTER _ TABLE, respectively.
Note that, when INSERT data, INSERT data in a table or a partition.
Watch 1
Figure BDA0002561621470000101
Figure BDA0002561621470000111
It should be noted that the above 7 metadata operation primitives are only an example, and this embodiment is not specifically limited to this.
Further, in step 303, the flow of steps is as shown in fig. 4, which specifically includes the following steps:
in step 401, a third operation event for the same operation object is determined from the second operation events.
Step 402, for each operation object, processing is performed according to the operation time and the operation type in the third operation event of the operation object, so as to obtain a fourth operation event of the operation object.
In step 403, each synchronization task corresponding to the fourth operation event of each operation object is generated.
Specifically, merging and sorting preprocessing are performed on the operation events according to the table name, the partition name and the operation event list, wherein the table name and the partition name are used as key value indexes, and sorting is performed inside the operation event list according to time information and is kept in order.
It should be noted that, in step 401, the same operation object may be the same table or the same partition, which is not specifically limited in this application.
Specifically, in step 402, for the deletion-type operation event, the third operation event whose operation time is before the deletion-type operation event is deleted.
For example, the operation events for the deletion type are generally classified into the following two cases:
1. and (4) partitioning a table: for the event of deletion type, i.e. DROP type, the search is performed according to the table name index and the PARTITION name index, if the same event, e.g. two same DROP _ PARTITION a, is found, the event merging is performed, i.e. one of the two DROP _ PARTITION a is deleted, and only one DROP _ PARTITION a is reserved.
Further, if the same event is not found, all events having the same table name and partition name as the event and having a position order before the DROP event are deleted and then written to the DROP event. For example, the operation event stream is ALTER _ PARTITION A, CREATE _ PARTITION B, DROP _ PARTITION A in turn, which is deleted when DROP _ PARTITION A is snooped to query ALTER _ PARTITION A with the same PARTITION name before DROP _ PARTITION A operation event.
2. Non-partition table: and searching for the event of the DROP type according to the TABLE name index, merging the events if the same event is searched, and merging the events if the same event is searched, such as three same DROP _ TABLE Cs, namely deleting two of the three DROP _ TABLE Cs and only keeping one DROP _ TABLE C.
If the same event is not found, deleting all events which have the same table name as the event and are positioned before the DROP event in sequence, and then writing the DROP event; for example, the operation event stream is ALTER _ TABLE C, CREATE _ TABLE D, DROP _ TABLE C in turn, and the ALTER _ TABLE C is deleted when it is snooped that DROP _ TABLE C can query ALTER _ TABLE C with the same TABLE name before DROP _ TABLE C operation event.
Specifically, for the newly created or modified type of operation event, the third operation events are merged by parameter aggregation.
For example, event merging is performed for CREATE and ALTER type events by parameter aggregation for the same type of event by table name and partition name.
It should be noted that the table types include a partition table and a non-partition table type.
Specifically, a HiveMetastoreeventFactory class is implemented in a parameter aggregation module to perform de-duplication combination on parameters of various CREATE and ALTER operations, and finally a plurality of large operation events are formed.
For example, the operational event stream sequentially modifies the first implementation parameter a into b, the second implementation parameter b into c, the third implementation parameter c into d, and the fourth implementation parameter c into d for the first ALTER _ TABLE a, and the third and fourth implementation parameters are first repeated with the fourth ALTER _ TABLE a, and then one of the parameters is deleted, and then the parameters are merged, and finally the implementation parameter a of the operational event is modified into d.
Further, for an insertion type operation event, if the insertion attribute is overwriting, deleting a third operation time of which the operation time is before the latest insertion event; and if the insert attribute is non-overwrite writing, merging the third operation events.
For example, event merging is performed according to a table name and a partition name for an INSERT type event. The INSERT type is divided into a partition table and a non-partition table, and is divided into an overwriting write and a non-overwriting write.
Specifically, for the overlay writing, the INSERT operation events are merged according to the TABLE name and the partition name, for example, the operation event stream sequentially implements an INSERT parameter a for a first INSERT _ TABLE a, an INSERT parameter b for a second INSERT _ TABLE a, an INSERT parameter c for a third INSERT _ TABLE a, and finally merges the parameters to form an INSERT parameter c for the INSERT operation event INSERT _ TABLE a.
Further, for non-overlay writing, all table names or partition names of the INSERT are removed and recorded; eventually leading to several large operational events.
For example, the operation event stream sequentially implements an insertion parameter E for a first INSERT _ TABLE E, an insertion parameter F for a second INSERT _ TABLE E, an insertion parameter g for a third INSERT _ TABLE F, and finally merges the parameters to form an operation event INSERT _ TABLE E and an insertion parameter F for the operation event INSERT _ TABLE F, and the insertion parameter g for the operation event INSERT _ TABLE F.
According to the scheme, the number of synchronous tasks is reduced by removing invalid operation events (such as deletion types), combining the operation events (such as new or modified types) according to parameters and writing the combined operation events (such as insertion types) according to whether the combined operation events are written in an overlapping mode, and the operation frequency of metadata and data is further reduced. Therefore, the number of executing synchronous tasks is reduced to the maximum extent, and the risk and hidden danger of event loss in the data stream transfer process are avoided.
Further, after the above process is completed, an optimized metadata operation primitive sequence is formed, and then a corresponding task to be synchronized is generated for each operation event in the metadata operation sequence primitive sequence.
Further, in step 204, each synchronization task corresponds to an operation event;
specifically, if the synchronization task is directed to the metadata of the source data warehouse, the metadata of the source data warehouse is synchronized to the destination data warehouse by executing the synchronization task;
and if the synchronization task is directed at the source data, after the synchronization task is executed, acquiring the source data corresponding to the synchronization task from the source file system and sending the source data to the destination file system.
According to the above, for example, for a DROP type operation event, the operation event is DROP _ TABLE C, and the DROP _ TABLE C is executed directly on the destination file system because it is directed to the metadata of the source data repository.
For another example, for an INSERT type operation event, the operation event is INSERT TABLE B, and then TABLE B is obtained in the source filesystem and sent to the destination filesystem storage.
According to the scheme, different modes are generated according to whether the synchronization task aims at the metadata of the source data warehouse or the source data, and the metadata of the synchronization task synchronization target data warehouse is directly executed according to the metadata, so that the data synchronization efficiency is improved.
In the embodiment of the application, the metadata operation primitive sequence is used as an input parameter for starting the job configuration and synchronization task. In one possible implementation, a SyncConf structure is constructed by each synchronization task input parameter, then a synchronization batch number is generated, the batch number and the SyncConf structure are submitted to a generatetask () function to automatically generate a synchronization task, and finally the generated synchronization task is periodically executed by a dispatcher taskeexecute condtapb of a Transport dispatch module.
In the embodiment of the present application, after step 204, it is determined whether the source file system and the destination file system are synchronized by way of path comparison.
According to the scheme, in order to better ensure the accuracy of data synchronization, the data synchronization is checked in a path comparison mode after the synchronization task is executed, so that the risk of errors generated in the data synchronization process is reduced.
Based on the same inventive concept, fig. 5 exemplarily illustrates a data synchronization apparatus provided in an embodiment of the present invention, which may be a flow of a data synchronization method.
The apparatus, comprising:
a monitoring module 501, configured to monitor a first operation event of metadata for a source data warehouse;
a processing module 502, configured to determine, according to a preset rule, a second operation event from the first operation event, where the second operation event is an operation event that is associated with a change in source data stored in a source file system;
generating each synchronous task corresponding to the second operation event;
and executing the synchronization tasks so as to realize the synchronization of the source file system and the destination file system.
Optionally, each synchronization task corresponds to an operation event;
the processing module 502 is specifically configured to:
if the synchronization task is directed to the metadata of the source data warehouse, the metadata of the source data warehouse is synchronized to the target data warehouse by executing the synchronization task;
and if the synchronization task is directed at source data, after the synchronization task is executed, acquiring the source data corresponding to the synchronization task from the source file system and sending the source data to a destination file system.
Optionally, the processing module 502 is specifically configured to:
determining a third operation event aiming at the same operation object from the second operation events;
processing each operation object according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object;
and generating each synchronous task corresponding to the fourth operation event of each operation object.
Optionally, the processing module 502 is specifically configured to:
deleting a third operation event with the operation time before the operation event of the deletion type aiming at the operation event of the deletion type; or
Combining all third operation events through parameter aggregation aiming at the newly-built or modified type operation events; or
For the operation event of the insertion type, if the insertion attribute is the overwriting, deleting a third operation time of which the operation time is before the latest insertion event; and if the insert attribute is non-overwrite writing, merging the third operation events.
Optionally, the processing module 502 is specifically configured to:
determining a second operation event containing set metadata operation primitives from the first operation event, wherein the set metadata operation primitives at least comprise one of the following items:
creating a partition, modifying a partition, deleting a partition, creating a new table, modifying a table, deleting a table, and inserting data.
Optionally, the monitoring module 501 is specifically configured to:
and carrying out batching on the monitored operation events aiming at the metadata of the source data warehouse, thereby obtaining the first operation event of each batch.
Optionally, after the synchronization tasks are executed to synchronize the source file system and the destination file system, the processing module 502 is further configured to:
and determining whether the source file system and the destination file system are synchronized or not in a path comparison mode.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for data synchronization is characterized in that the method is suitable for a distributed file system with a data warehouse; the method comprises the following steps:
monitoring a first operation event aiming at metadata of a source data warehouse;
determining a second operation event from the first operation events according to a preset rule, wherein the second operation event refers to an operation event which is changed and associated with source data stored in a source file system;
generating each synchronous task corresponding to the second operation event;
and executing the synchronization tasks so as to realize the synchronization of the source file system and the destination file system.
2. The method of claim 1, wherein each synchronization task corresponds to an operational event;
executing the synchronization tasks to achieve synchronization of the source file system and the destination file system, including:
if the synchronization task is directed to the metadata of the source data warehouse, the metadata of the source data warehouse is synchronized to the target data warehouse by executing the synchronization task;
and if the synchronization task is directed at source data, after the synchronization task is executed, acquiring the source data corresponding to the synchronization task from the source file system and sending the source data to a destination file system.
3. The method of claim 1, wherein generating each synchronization task corresponding to the second operational event comprises:
determining a third operation event aiming at the same operation object from the second operation events;
processing each operation object according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object;
and generating each synchronous task corresponding to the fourth operation event of each operation object.
4. The method of claim 3, wherein performing processing according to the operation time and the operation type in the third operation event of the operation object to obtain a fourth operation event of the operation object comprises:
deleting a third operation event with the operation time before the operation event of the deletion type aiming at the operation event of the deletion type; or
Combining all third operation events through parameter aggregation aiming at the newly-built or modified type operation events; or
For the operation event of the insertion type, if the insertion attribute is the overwriting, deleting a third operation time of which the operation time is before the latest insertion event; and if the insert attribute is non-overwrite writing, merging the third operation events.
5. The method of any of claims 1 to 4, wherein determining a second operational event from the first operational event comprises:
determining a second operation event containing set metadata operation primitives from the first operation event, wherein the set metadata operation primitives at least comprise one of the following items:
creating a partition, modifying a partition, deleting a partition, creating a new table, modifying a table, deleting a table, and inserting data.
6. The method of claim 1,
listening for a first operational event of metadata for a source data warehouse, comprising:
and carrying out batching on the monitored operation events aiming at the metadata of the source data warehouse, thereby obtaining the first operation event of each batch.
7. The method of claim 1, wherein after performing the synchronization tasks to achieve synchronization of the source filesystem with the destination filesystem, further comprising:
and determining whether the source file system and the destination file system are synchronized or not in a path comparison mode.
8. An apparatus for data synchronization, comprising:
the monitoring module is used for monitoring a first operation event aiming at the metadata of the source data warehouse;
the processing module is used for determining a second operation event from the first operation event, wherein the second operation event refers to an operation event related to the change of source data stored in a source file system;
generating each synchronous task corresponding to the second operation event;
and executing the synchronization tasks so as to realize the synchronization of the source file system and the destination file system.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory to perform the method of any of claims 1 to 7 in accordance with the obtained program.
10. A computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202010615069.0A 2020-06-30 2020-06-30 Data synchronization method and device Pending CN111680017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010615069.0A CN111680017A (en) 2020-06-30 2020-06-30 Data synchronization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010615069.0A CN111680017A (en) 2020-06-30 2020-06-30 Data synchronization method and device

Publications (1)

Publication Number Publication Date
CN111680017A true CN111680017A (en) 2020-09-18

Family

ID=72437457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010615069.0A Pending CN111680017A (en) 2020-06-30 2020-06-30 Data synchronization method and device

Country Status (1)

Country Link
CN (1) CN111680017A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506870A (en) * 2020-12-18 2021-03-16 上海哔哩哔哩科技有限公司 Data warehouse increment updating method and device and computer equipment
CN112948486A (en) * 2021-02-04 2021-06-11 北京淇瑀信息科技有限公司 Batch data synchronization method and system and electronic equipment
CN113407634A (en) * 2021-07-05 2021-09-17 挂号网(杭州)科技有限公司 Data synchronization method, device, system, server and storage medium
CN115328997A (en) * 2022-07-15 2022-11-11 深圳市数帝网络科技有限公司 Data synchronization method, system, device and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506870A (en) * 2020-12-18 2021-03-16 上海哔哩哔哩科技有限公司 Data warehouse increment updating method and device and computer equipment
CN112948486A (en) * 2021-02-04 2021-06-11 北京淇瑀信息科技有限公司 Batch data synchronization method and system and electronic equipment
CN113407634A (en) * 2021-07-05 2021-09-17 挂号网(杭州)科技有限公司 Data synchronization method, device, system, server and storage medium
CN115328997A (en) * 2022-07-15 2022-11-11 深圳市数帝网络科技有限公司 Data synchronization method, system, device and storage medium

Similar Documents

Publication Publication Date Title
US11669510B2 (en) Parallel processing of disjoint change streams into a single stream
EP3602341B1 (en) Data replication system
Sumbaly et al. The big data ecosystem at linkedin
Carbone et al. Apache flink: Stream and batch processing in a single engine
CN111680017A (en) Data synchronization method and device
CN108694195B (en) Management method and system of distributed data warehouse
US20130166502A1 (en) Segmented storage for database clustering
TW201530328A (en) Method and device for constructing NoSQL database index for semi-structured data
CN112286941B (en) Big data synchronization method and device based on Binlog + HBase + Hive
US10545988B2 (en) System and method for data synchronization using revision control
CN111930768B (en) Incremental data acquisition method, incremental data transmission method, incremental data acquisition device, incremental data transmission device and computer storage medium
CN114328759A (en) Data construction and management method and terminal of data warehouse
CN114329096A (en) Method and system for processing native map database
US9405828B2 (en) System and method for phonetic searching of data
CN110019169B (en) Data processing method and device
CN112965939A (en) File merging method, device and equipment
US11789971B1 (en) Adding replicas to a multi-leader replica group for a data set
CN109739883B (en) Method and device for improving data query performance and electronic equipment
CN113590651A (en) Cross-cluster data processing system and method based on HQL
CN115840786B (en) Data lake data synchronization method and device
US20240070180A1 (en) Mutation-Responsive Documentation Regeneration Based on Knowledge Base
US11914655B2 (en) Mutation-responsive documentation generation based on knowledge base
US11803511B2 (en) Methods and systems for ordering operations on a file system having a hierarchical namespace
Johnson et al. Big data processing using Hadoop MapReduce programming model
CN116028465A (en) Data migration method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination