CN105069128A - Data synchronization method and apparatus - Google Patents

Data synchronization method and apparatus Download PDF

Info

Publication number
CN105069128A
CN105069128A CN201510500106.2A CN201510500106A CN105069128A CN 105069128 A CN105069128 A CN 105069128A CN 201510500106 A CN201510500106 A CN 201510500106A CN 105069128 A CN105069128 A CN 105069128A
Authority
CN
China
Prior art keywords
data
cluster
check
task
judged result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510500106.2A
Other languages
Chinese (zh)
Other versions
CN105069128B (en
Inventor
杨泽森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201510500106.2A priority Critical patent/CN105069128B/en
Publication of CN105069128A publication Critical patent/CN105069128A/en
Application granted granted Critical
Publication of CN105069128B publication Critical patent/CN105069128B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a data synchronization method and apparatus, which can implement data unification inside a hadoop cluster or between hadoop clusters, and have the advantages of being simple and easy and the like. The data synchronization method comprises: judging a data synchronization type of a hadoop cluster, wherein the data synchronization type includes data copying inside the cluster, address sharing inside the cluster, and data copying among clusters; executing, according to a determining result, a pre-selected data quality checking task corresponding to the determining result; and when data inconsistency is checked during execution of the data quality checking task, executing the last data synchronization task.

Description

Method of data synchronization and device
Technical field
The present invention relates to field of computer technology, particularly a kind of method of data synchronization and device.
Background technology
Current a lot of large-scale IT enterprises are proposed the service such as cloud platform, large data platform, cloud computing, cloud storage, Data Mart, for achieving data sharing and data-transformation facility between different enterprise or between each business department of enterprises.The data that Fig. 1 shows in the environment such as large data platform, cloud service between cluster internal and cluster carry out synchronous process.But the synchronous work lacking the quality of data afterwards and check, may exist data inconsistency problem, particularly: when data sharing number formulary is according to transmission change, the data that data subscription side is synchronous will be inconsistent with data sharing side.Data subscription side is difficult to discover data sharing number formulary according to changing, and often brings about great losses after to be found.After data inconsistency problem occurs, data subscription side does not obtain latest data in time again, causes data subscription side's linksystem error in data.To sum up, in prior art, data inconsistency problem occur after, lack informing mechanism timely, also lack timely, robotization, intelligentized data difference treatment mechanism, bring massive losses.
Summary of the invention
In view of this, the invention provides a kind of method of data synchronization and device, the data that can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
For achieving the above object, according to an aspect of the present invention, a kind of method of data synchronization is provided.Method of data synchronization of the present invention comprises: the data syn-chronization type judging hadoop cluster, and described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task; When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
Alternatively, perform the described quality of data according to predetermined period and check task.
For achieving the above object, according to a further aspect in the invention, a kind of data synchronization unit is provided.Data synchronization unit of the present invention comprises: judge module, and for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result; Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, described check module also for: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, also comprise: preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
Alternatively, check module described in and also check task for performing the described quality of data according to predetermined period.
According to technical scheme of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Accompanying drawing explanation
Accompanying drawing is used for understanding the present invention better, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is the process schematic of data syn-chronization between cluster internal data syn-chronization and cluster;
Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention;
Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, one exemplary embodiment of the present invention is explained, comprising the various details of the embodiment of the present invention to help understanding, they should be thought it is only exemplary.Therefore, those of ordinary skill in the art will be appreciated that, can make various change and amendment, and can not deviate from scope and spirit of the present invention to the embodiments described herein.Equally, for clarity and conciseness, the description to known function and structure is eliminated in following description.
Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention.As shown in Figure 2, this method of data synchronization can comprise following step S21 to step S23.
Step S21: the data syn-chronization type judging hadoop cluster, data syn-chronization type comprises cluster internal data copy (ClusterInternalDataCopy), data copy (InterClusterDataCopy) between (ClusterInternalDataAddressSharing) and cluster is shared in cluster internal address.
Cluster internal data copy refers to that HDFS (HadoopDistributedFileSystem, the distributed file system) data file realizing being shown by copy hive between different user at same cluster internal realizes data sharing.
Cluster internal data address shares the another one user referring to and to be shared to the data file position in hive under certain user under same cluster in this cluster, realizes same number certificate, the demand that multi-user uses.This kind of method passes through the Location address of a mandate user hive table to another one user.Under which, different user is by sharing a HDFS data file.In order to data security, need to treat with a certain discrimination the operating right of user.Data publication side has highest weight limit, and subscriber only has digital independent authority.
Between cluster, data copy refers to and realizes data copy between different cluster, adopts distcp (distributed copy) program, copies HDFS data file from hadoop is parallel.
Step S22: perform according to judged result the quality of data corresponding to this judged result selected in advance and check task.Detailed process can be as follows:
When judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent.
When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent.
When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve the quality of data result performing data kernel of mass and obtain task, quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.
Alternatively, data kernel of mass is performed to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.
Step S23: when perform data kernel of mass to be checked through during task exist data inconsistent, again perform the last data syn-chronization task.
It should be noted that, focusing on of " again performing the last data syn-chronization task ", according to data source and the datum target of the last synchronization action, try again data syn-chronization action.
Can find out from the above description, according to method of data synchronization of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.As shown in Figure 3, data synchronization unit 30 of the present invention can comprise: judge module 31, check module 32 and synchronization module 33.Judge module 31 is for judging the data syn-chronization type of hadoop cluster.Data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster.Check module 32 and check task for performing the quality of data corresponding to this judged result selected in advance according to judged result.Synchronization module 33 for when check module 32 be checked through exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, check module 32 also for: when judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, data synchronization unit 30 also comprises preservation module (not shown in Fig. 3).Preserve module for preserving the quality of data result checked module 32 and obtain.Quality of data result can comprise following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.
Alternatively, module 32 is checked also for performing data kernel of mass to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.
Can find out from the above description, according to data synchronization unit of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Above-mentioned embodiment, does not form limiting the scope of the invention.It is to be understood that depend on designing requirement and other factors, various amendment, combination, sub-portfolio can be there is and substitute in those skilled in the art.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within scope.

Claims (8)

1. a method of data synchronization, is characterized in that, comprising:
Judge the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;
Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task;
When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.
2. method of data synchronization according to claim 1, is characterized in that, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises:
When described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent;
When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent;
When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
3. method of data synchronization according to claim 1, is characterized in that, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprises:
Preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
4. method of data synchronization according to claim 1, is characterized in that, performs the described quality of data check task according to predetermined period.
5. a data synchronization unit, is characterized in that, comprising:
Judge module, for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;
Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result;
Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.
6. data synchronization unit according to claim 5, is characterized in that, described in check module also for:
When described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent;
When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent;
When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
7. data synchronization unit according to claim 5, is characterized in that, also comprises:
Preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
8. data synchronization unit according to claim 5, is characterized in that, described in check module and also check task for performing the described quality of data according to predetermined period.
CN201510500106.2A 2015-08-14 2015-08-14 Method of data synchronization and device Active CN105069128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510500106.2A CN105069128B (en) 2015-08-14 2015-08-14 Method of data synchronization and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510500106.2A CN105069128B (en) 2015-08-14 2015-08-14 Method of data synchronization and device

Publications (2)

Publication Number Publication Date
CN105069128A true CN105069128A (en) 2015-11-18
CN105069128B CN105069128B (en) 2018-11-09

Family

ID=54498498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510500106.2A Active CN105069128B (en) 2015-08-14 2015-08-14 Method of data synchronization and device

Country Status (1)

Country Link
CN (1) CN105069128B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847378A (en) * 2016-04-13 2016-08-10 北京思特奇信息技术股份有限公司 Big data synchronizing method and system
CN107818106A (en) * 2016-09-13 2018-03-20 腾讯科技(深圳)有限公司 A kind of big data off-line calculation quality of data method of calibration and device
CN108804206A (en) * 2017-04-26 2018-11-13 武汉斗鱼网络科技有限公司 The processing method and system of synchronous task
CN110209653A (en) * 2019-06-04 2019-09-06 中国农业银行股份有限公司 HBase data migration method and moving apparatus
WO2020140645A1 (en) * 2019-01-03 2020-07-09 深圳壹账通智能科技有限公司 Abnormal data provision detection method and apparatus based on data migration, and terminal device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499023A (en) * 2008-02-02 2009-08-05 英华达(上海)科技有限公司 Data copying and checking method
US20130013558A1 (en) * 2011-07-08 2013-01-10 Belk Andrew T Semantic checks for synchronization: imposing ordinality constraints for relationships via learned ordinality
US20140122429A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Data processing method and apparatus for distributed systems
CN104239493A (en) * 2014-09-09 2014-12-24 北京京东尚科信息技术有限公司 Cross-cluster data migration method and system
CN104699771A (en) * 2015-03-02 2015-06-10 北京京东尚科信息技术有限公司 Data synchronization method and clustering node

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101499023A (en) * 2008-02-02 2009-08-05 英华达(上海)科技有限公司 Data copying and checking method
US20130013558A1 (en) * 2011-07-08 2013-01-10 Belk Andrew T Semantic checks for synchronization: imposing ordinality constraints for relationships via learned ordinality
US20140122429A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Data processing method and apparatus for distributed systems
CN104239493A (en) * 2014-09-09 2014-12-24 北京京东尚科信息技术有限公司 Cross-cluster data migration method and system
CN104699771A (en) * 2015-03-02 2015-06-10 北京京东尚科信息技术有限公司 Data synchronization method and clustering node

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘文娟: ""基于Hadoop的文件同步存储系统的设计与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847378A (en) * 2016-04-13 2016-08-10 北京思特奇信息技术股份有限公司 Big data synchronizing method and system
CN105847378B (en) * 2016-04-13 2019-06-28 北京思特奇信息技术股份有限公司 A kind of method and system for realizing that big data is synchronous
CN107818106A (en) * 2016-09-13 2018-03-20 腾讯科技(深圳)有限公司 A kind of big data off-line calculation quality of data method of calibration and device
CN107818106B (en) * 2016-09-13 2021-11-16 腾讯科技(深圳)有限公司 Big data offline calculation data quality verification method and device
CN108804206A (en) * 2017-04-26 2018-11-13 武汉斗鱼网络科技有限公司 The processing method and system of synchronous task
CN108804206B (en) * 2017-04-26 2021-04-09 武汉斗鱼网络科技有限公司 Processing method and system for synchronous task
WO2020140645A1 (en) * 2019-01-03 2020-07-09 深圳壹账通智能科技有限公司 Abnormal data provision detection method and apparatus based on data migration, and terminal device
CN110209653A (en) * 2019-06-04 2019-09-06 中国农业银行股份有限公司 HBase data migration method and moving apparatus

Also Published As

Publication number Publication date
CN105069128B (en) 2018-11-09

Similar Documents

Publication Publication Date Title
US11238008B2 (en) Automatic archiving of data store log data
CN105069128A (en) Data synchronization method and apparatus
US9229997B1 (en) Embeddable cloud analytics
US9923966B1 (en) Flexible media storage and organization in automated data storage systems
AU2016405587B2 (en) Splitting and moving ranges in a distributed system
DE112012005037B4 (en) Manage redundant immutable files using deduplications in storage clouds
CA2892889C (en) Scaling computing clusters
US7958088B2 (en) Dynamic data reorganization to accommodate growth across replicated databases
US20160156631A1 (en) Methods and systems for shared file storage
CN104580439B (en) Method for uniformly distributing data in cloud storage system
US10698890B2 (en) Dual overlay query processing
CN104978336A (en) Unstructured data storage system based on Hadoop distributed computing platform
US9984139B1 (en) Publish session framework for datastore operation records
US10204021B2 (en) Recovery of an infected and quarantined file in a primary storage controller from a secondary storage controller
CN105095384B (en) The method and apparatus that data are carried down
CN105205154A (en) Data migration method and device
CN109254998B (en) Data management method, Internet of things equipment, database server and system
EP3158478B1 (en) Embeddable cloud analytics
JP6059558B2 (en) Load balancing judgment system
KR102016417B1 (en) Data server device configured to manage distributed lock of file together with client device in storage system employing distributed file system
CN111159140A (en) Data processing method and device, electronic equipment and storage medium
US10379959B1 (en) Techniques and systems for physical manipulation of data storage devices
US11222036B1 (en) Data warehouse access reporting
US10033737B2 (en) System and method for cross-cloud identity matching
US10303553B2 (en) Providing data backup

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant