CN105069128A

CN105069128A - Data synchronization method and apparatus

Info

Publication number: CN105069128A
Application number: CN201510500106.2A
Authority: CN
Inventors: 杨泽森
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2015-08-14
Filing date: 2015-08-14
Publication date: 2015-11-18
Anticipated expiration: 2035-08-14
Also published as: CN105069128B

Abstract

The present invention provides a data synchronization method and apparatus, which can implement data unification inside a hadoop cluster or between hadoop clusters, and have the advantages of being simple and easy and the like. The data synchronization method comprises: judging a data synchronization type of a hadoop cluster, wherein the data synchronization type includes data copying inside the cluster, address sharing inside the cluster, and data copying among clusters; executing, according to a determining result, a pre-selected data quality checking task corresponding to the determining result; and when data inconsistency is checked during execution of the data quality checking task, executing the last data synchronization task.

Description

Method of data synchronization and device

Technical field

The present invention relates to field of computer technology, particularly a kind of method of data synchronization and device.

Background technology

Current a lot of large-scale IT enterprises are proposed the service such as cloud platform, large data platform, cloud computing, cloud storage, Data Mart, for achieving data sharing and data-transformation facility between different enterprise or between each business department of enterprises.The data that Fig. 1 shows in the environment such as large data platform, cloud service between cluster internal and cluster carry out synchronous process.But the synchronous work lacking the quality of data afterwards and check, may exist data inconsistency problem, particularly: when data sharing number formulary is according to transmission change, the data that data subscription side is synchronous will be inconsistent with data sharing side.Data subscription side is difficult to discover data sharing number formulary according to changing, and often brings about great losses after to be found.After data inconsistency problem occurs, data subscription side does not obtain latest data in time again, causes data subscription side's linksystem error in data.To sum up, in prior art, data inconsistency problem occur after, lack informing mechanism timely, also lack timely, robotization, intelligentized data difference treatment mechanism, bring massive losses.

Summary of the invention

In view of this, the invention provides a kind of method of data synchronization and device, the data that can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.

For achieving the above object, according to an aspect of the present invention, a kind of method of data synchronization is provided.Method of data synchronization of the present invention comprises: the data syn-chronization type judging hadoop cluster, and described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task; When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.

Alternatively, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.

Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.

Alternatively, perform the described quality of data according to predetermined period and check task.

For achieving the above object, according to a further aspect in the invention, a kind of data synchronization unit is provided.Data synchronization unit of the present invention comprises: judge module, and for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result; Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.

Alternatively, described check module also for: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.

Alternatively, also comprise: preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.

Alternatively, check module described in and also check task for performing the described quality of data according to predetermined period.

According to technical scheme of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.

Accompanying drawing explanation

Accompanying drawing is used for understanding the present invention better, does not form inappropriate limitation of the present invention.Wherein:

Fig. 1 is the process schematic of data syn-chronization between cluster internal data syn-chronization and cluster;

Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention;

Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, one exemplary embodiment of the present invention is explained, comprising the various details of the embodiment of the present invention to help understanding, they should be thought it is only exemplary.Therefore, those of ordinary skill in the art will be appreciated that, can make various change and amendment, and can not deviate from scope and spirit of the present invention to the embodiments described herein.Equally, for clarity and conciseness, the description to known function and structure is eliminated in following description.

Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention.As shown in Figure 2, this method of data synchronization can comprise following step S21 to step S23.

Step S21: the data syn-chronization type judging hadoop cluster, data syn-chronization type comprises cluster internal data copy (ClusterInternalDataCopy), data copy (InterClusterDataCopy) between (ClusterInternalDataAddressSharing) and cluster is shared in cluster internal address.

Cluster internal data copy refers to that HDFS (HadoopDistributedFileSystem, the distributed file system) data file realizing being shown by copy hive between different user at same cluster internal realizes data sharing.

Cluster internal data address shares the another one user referring to and to be shared to the data file position in hive under certain user under same cluster in this cluster, realizes same number certificate, the demand that multi-user uses.This kind of method passes through the Location address of a mandate user hive table to another one user.Under which, different user is by sharing a HDFS data file.In order to data security, need to treat with a certain discrimination the operating right of user.Data publication side has highest weight limit, and subscriber only has digital independent authority.

Between cluster, data copy refers to and realizes data copy between different cluster, adopts distcp (distributed copy) program, copies HDFS data file from hadoop is parallel.

Step S22: perform according to judged result the quality of data corresponding to this judged result selected in advance and check task.Detailed process can be as follows:

When judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent.

When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent.

When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.

Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve the quality of data result performing data kernel of mass and obtain task, quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.

Alternatively, data kernel of mass is performed to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.

Step S23: when perform data kernel of mass to be checked through during task exist data inconsistent, again perform the last data syn-chronization task.

It should be noted that, focusing on of " again performing the last data syn-chronization task ", according to data source and the datum target of the last synchronization action, try again data syn-chronization action.

Can find out from the above description, according to method of data synchronization of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.

Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.As shown in Figure 3, data synchronization unit 30 of the present invention can comprise: judge module 31, check module 32 and synchronization module 33.Judge module 31 is for judging the data syn-chronization type of hadoop cluster.Data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster.Check module 32 and check task for performing the quality of data corresponding to this judged result selected in advance according to judged result.Synchronization module 33 for when check module 32 be checked through exist data inconsistent, again perform the last data syn-chronization task.

Alternatively, check module 32 also for: when judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.

Alternatively, data synchronization unit 30 also comprises preservation module (not shown in Fig. 3).Preserve module for preserving the quality of data result checked module 32 and obtain.Quality of data result can comprise following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.

Alternatively, module 32 is checked also for performing data kernel of mass to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.

Can find out from the above description, according to data synchronization unit of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.

Above-mentioned embodiment, does not form limiting the scope of the invention.It is to be understood that depend on designing requirement and other factors, various amendment, combination, sub-portfolio can be there is and substitute in those skilled in the art.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within scope.

Claims

1. a method of data synchronization, is characterized in that, comprising:

Judge the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;

Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task;

When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.

2. method of data synchronization according to claim 1, is characterized in that, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises:

When described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent;

When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent;

When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.

3. method of data synchronization according to claim 1, is characterized in that, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprises:

Preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.

4. method of data synchronization according to claim 1, is characterized in that, performs the described quality of data check task according to predetermined period.

5. a data synchronization unit, is characterized in that, comprising:

Judge module, for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;

Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result;

Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.

6. data synchronization unit according to claim 5, is characterized in that, described in check module also for:

7. data synchronization unit according to claim 5, is characterized in that, also comprises:

Preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.

8. data synchronization unit according to claim 5, is characterized in that, described in check module and also check task for performing the described quality of data according to predetermined period.