CN105069128A - Data synchronization method and apparatus - Google Patents
Data synchronization method and apparatus Download PDFInfo
- Publication number
- CN105069128A CN105069128A CN201510500106.2A CN201510500106A CN105069128A CN 105069128 A CN105069128 A CN 105069128A CN 201510500106 A CN201510500106 A CN 201510500106A CN 105069128 A CN105069128 A CN 105069128A
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- check
- task
- judged result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a data synchronization method and apparatus, which can implement data unification inside a hadoop cluster or between hadoop clusters, and have the advantages of being simple and easy and the like. The data synchronization method comprises: judging a data synchronization type of a hadoop cluster, wherein the data synchronization type includes data copying inside the cluster, address sharing inside the cluster, and data copying among clusters; executing, according to a determining result, a pre-selected data quality checking task corresponding to the determining result; and when data inconsistency is checked during execution of the data quality checking task, executing the last data synchronization task.
Description
Technical field
The present invention relates to field of computer technology, particularly a kind of method of data synchronization and device.
Background technology
Current a lot of large-scale IT enterprises are proposed the service such as cloud platform, large data platform, cloud computing, cloud storage, Data Mart, for achieving data sharing and data-transformation facility between different enterprise or between each business department of enterprises.The data that Fig. 1 shows in the environment such as large data platform, cloud service between cluster internal and cluster carry out synchronous process.But the synchronous work lacking the quality of data afterwards and check, may exist data inconsistency problem, particularly: when data sharing number formulary is according to transmission change, the data that data subscription side is synchronous will be inconsistent with data sharing side.Data subscription side is difficult to discover data sharing number formulary according to changing, and often brings about great losses after to be found.After data inconsistency problem occurs, data subscription side does not obtain latest data in time again, causes data subscription side's linksystem error in data.To sum up, in prior art, data inconsistency problem occur after, lack informing mechanism timely, also lack timely, robotization, intelligentized data difference treatment mechanism, bring massive losses.
Summary of the invention
In view of this, the invention provides a kind of method of data synchronization and device, the data that can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
For achieving the above object, according to an aspect of the present invention, a kind of method of data synchronization is provided.Method of data synchronization of the present invention comprises: the data syn-chronization type judging hadoop cluster, and described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task; When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
Alternatively, perform the described quality of data according to predetermined period and check task.
For achieving the above object, according to a further aspect in the invention, a kind of data synchronization unit is provided.Data synchronization unit of the present invention comprises: judge module, and for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster; Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result; Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, described check module also for: when described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, also comprise: preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
Alternatively, check module described in and also check task for performing the described quality of data according to predetermined period.
According to technical scheme of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Accompanying drawing explanation
Accompanying drawing is used for understanding the present invention better, does not form inappropriate limitation of the present invention.Wherein:
Fig. 1 is the process schematic of data syn-chronization between cluster internal data syn-chronization and cluster;
Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention;
Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, one exemplary embodiment of the present invention is explained, comprising the various details of the embodiment of the present invention to help understanding, they should be thought it is only exemplary.Therefore, those of ordinary skill in the art will be appreciated that, can make various change and amendment, and can not deviate from scope and spirit of the present invention to the embodiments described herein.Equally, for clarity and conciseness, the description to known function and structure is eliminated in following description.
Fig. 2 is the schematic diagram of the basic step of method of data synchronization according to the embodiment of the present invention.As shown in Figure 2, this method of data synchronization can comprise following step S21 to step S23.
Step S21: the data syn-chronization type judging hadoop cluster, data syn-chronization type comprises cluster internal data copy (ClusterInternalDataCopy), data copy (InterClusterDataCopy) between (ClusterInternalDataAddressSharing) and cluster is shared in cluster internal address.
Cluster internal data copy refers to that HDFS (HadoopDistributedFileSystem, the distributed file system) data file realizing being shown by copy hive between different user at same cluster internal realizes data sharing.
Cluster internal data address shares the another one user referring to and to be shared to the data file position in hive under certain user under same cluster in this cluster, realizes same number certificate, the demand that multi-user uses.This kind of method passes through the Location address of a mandate user hive table to another one user.Under which, different user is by sharing a HDFS data file.In order to data security, need to treat with a certain discrimination the operating right of user.Data publication side has highest weight limit, and subscriber only has digital independent authority.
Between cluster, data copy refers to and realizes data copy between different cluster, adopts distcp (distributed copy) program, copies HDFS data file from hadoop is parallel.
Step S22: perform according to judged result the quality of data corresponding to this judged result selected in advance and check task.Detailed process can be as follows:
When judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent.
When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent.
When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprise: preserve the quality of data result performing data kernel of mass and obtain task, quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.
Alternatively, data kernel of mass is performed to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.
Step S23: when perform data kernel of mass to be checked through during task exist data inconsistent, again perform the last data syn-chronization task.
It should be noted that, focusing on of " again performing the last data syn-chronization task ", according to data source and the datum target of the last synchronization action, try again data syn-chronization action.
Can find out from the above description, according to method of data synchronization of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Fig. 3 is the schematic diagram of the main modular of data synchronization unit according to the embodiment of the present invention.As shown in Figure 3, data synchronization unit 30 of the present invention can comprise: judge module 31, check module 32 and synchronization module 33.Judge module 31 is for judging the data syn-chronization type of hadoop cluster.Data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster.Check module 32 and check task for performing the quality of data corresponding to this judged result selected in advance according to judged result.Synchronization module 33 for when check module 32 be checked through exist data inconsistent, again perform the last data syn-chronization task.
Alternatively, check module 32 also for: when judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent; When judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent; When judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
Alternatively, data synchronization unit 30 also comprises preservation module (not shown in Fig. 3).Preserve module for preserving the quality of data result checked module 32 and obtain.Quality of data result can comprise following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.Record quality of data result can provide data basis for the personnel of management cluster system carry out analysis decision.
Alternatively, module 32 is checked also for performing data kernel of mass to task according to predetermined period.The inconsistent phenomenon of data can be eliminated termly like this.
Can find out from the above description, according to data synchronization unit of the present invention, first judge data syn-chronization type, then perform the corresponding quality of data and check task, if it is inconsistent to there are data, then carry out data syn-chronization.Therefore, the data that technical scheme of the present invention can realize between the inner or hadoop cluster of hadoop cluster are unified, have simple and easy to do etc. advantage.
Above-mentioned embodiment, does not form limiting the scope of the invention.It is to be understood that depend on designing requirement and other factors, various amendment, combination, sub-portfolio can be there is and substitute in those skilled in the art.Any amendment done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within scope.
Claims (8)
1. a method of data synchronization, is characterized in that, comprising:
Judge the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;
Perform according to judged result the quality of data corresponding to this judged result selected in advance and check task;
When perform the described quality of data check task time be checked through exist data inconsistent, again perform the last data syn-chronization task.
2. method of data synchronization according to claim 1, is characterized in that, the step that the described quality of data corresponding to this judged result selected in advance according to judged result execution checks task comprises:
When described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent;
When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent;
When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
3. method of data synchronization according to claim 1, is characterized in that, after performing the quality of data corresponding to this judged result selected in advance according to judged result and checking task, also comprises:
Preserve and perform the described quality of data and check the quality of data result that task obtains, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
4. method of data synchronization according to claim 1, is characterized in that, performs the described quality of data check task according to predetermined period.
5. a data synchronization unit, is characterized in that, comprising:
Judge module, for judging the data syn-chronization type of hadoop cluster, described data syn-chronization type comprises cluster internal data copy, cluster internal address is shared and data copy between cluster;
Check module, check task for performing the quality of data corresponding to this judged result selected in advance according to judged result;
Synchronization module, for when described check module check to exist data inconsistent, again perform the last data syn-chronization task.
6. data synchronization unit according to claim 5, is characterized in that, described in check module also for:
When described judged result is cluster internal data copy, check the data source copied in the data copy task under same cluster between different user whether consistent with file size with the quantity of datum target HDFS file, whether the metadata information simultaneously checking first storage list of hive data warehouse is consistent;
When described judged result be cluster internal address share, whether the metadata information checking first storage list of different user in hive data warehouse consistent;
When described judged result is data copy between cluster, check data source cluster whether consistent with file size with the quantity of each self-corresponding HDFS file of datum target cluster, check data source cluster whether consistent with the metadata information of first storage list of self-corresponding hive data warehouse each in datum target cluster simultaneously.
7. data synchronization unit according to claim 5, is characterized in that, also comprises:
Preserve module, for checking the quality of data result that module obtains described in preserving, described quality of data result comprises following one or more: data syn-chronization task identification, the quality of data check task identification, data source file size, target data file size, data syn-chronization task execution time, verification of data task execution time.
8. data synchronization unit according to claim 5, is characterized in that, described in check module and also check task for performing the described quality of data according to predetermined period.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510500106.2A CN105069128B (en) | 2015-08-14 | 2015-08-14 | Method of data synchronization and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510500106.2A CN105069128B (en) | 2015-08-14 | 2015-08-14 | Method of data synchronization and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105069128A true CN105069128A (en) | 2015-11-18 |
CN105069128B CN105069128B (en) | 2018-11-09 |
Family
ID=54498498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510500106.2A Active CN105069128B (en) | 2015-08-14 | 2015-08-14 | Method of data synchronization and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105069128B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105847378A (en) * | 2016-04-13 | 2016-08-10 | 北京思特奇信息技术股份有限公司 | Big data synchronizing method and system |
CN107818106A (en) * | 2016-09-13 | 2018-03-20 | 腾讯科技(深圳)有限公司 | A kind of big data off-line calculation quality of data method of calibration and device |
CN108804206A (en) * | 2017-04-26 | 2018-11-13 | 武汉斗鱼网络科技有限公司 | The processing method and system of synchronous task |
CN110209653A (en) * | 2019-06-04 | 2019-09-06 | 中国农业银行股份有限公司 | HBase data migration method and moving apparatus |
WO2020140645A1 (en) * | 2019-01-03 | 2020-07-09 | 深圳壹账通智能科技有限公司 | Abnormal data provision detection method and apparatus based on data migration, and terminal device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101499023A (en) * | 2008-02-02 | 2009-08-05 | 英华达(上海)科技有限公司 | Data copying and checking method |
US20130013558A1 (en) * | 2011-07-08 | 2013-01-10 | Belk Andrew T | Semantic checks for synchronization: imposing ordinality constraints for relationships via learned ordinality |
US20140122429A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Data processing method and apparatus for distributed systems |
CN104239493A (en) * | 2014-09-09 | 2014-12-24 | 北京京东尚科信息技术有限公司 | Cross-cluster data migration method and system |
CN104699771A (en) * | 2015-03-02 | 2015-06-10 | 北京京东尚科信息技术有限公司 | Data synchronization method and clustering node |
-
2015
- 2015-08-14 CN CN201510500106.2A patent/CN105069128B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101499023A (en) * | 2008-02-02 | 2009-08-05 | 英华达(上海)科技有限公司 | Data copying and checking method |
US20130013558A1 (en) * | 2011-07-08 | 2013-01-10 | Belk Andrew T | Semantic checks for synchronization: imposing ordinality constraints for relationships via learned ordinality |
US20140122429A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Data processing method and apparatus for distributed systems |
CN104239493A (en) * | 2014-09-09 | 2014-12-24 | 北京京东尚科信息技术有限公司 | Cross-cluster data migration method and system |
CN104699771A (en) * | 2015-03-02 | 2015-06-10 | 北京京东尚科信息技术有限公司 | Data synchronization method and clustering node |
Non-Patent Citations (1)
Title |
---|
刘文娟: ""基于Hadoop的文件同步存储系统的设计与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105847378A (en) * | 2016-04-13 | 2016-08-10 | 北京思特奇信息技术股份有限公司 | Big data synchronizing method and system |
CN105847378B (en) * | 2016-04-13 | 2019-06-28 | 北京思特奇信息技术股份有限公司 | A kind of method and system for realizing that big data is synchronous |
CN107818106A (en) * | 2016-09-13 | 2018-03-20 | 腾讯科技(深圳)有限公司 | A kind of big data off-line calculation quality of data method of calibration and device |
CN107818106B (en) * | 2016-09-13 | 2021-11-16 | 腾讯科技(深圳)有限公司 | Big data offline calculation data quality verification method and device |
CN108804206A (en) * | 2017-04-26 | 2018-11-13 | 武汉斗鱼网络科技有限公司 | The processing method and system of synchronous task |
CN108804206B (en) * | 2017-04-26 | 2021-04-09 | 武汉斗鱼网络科技有限公司 | Processing method and system for synchronous task |
WO2020140645A1 (en) * | 2019-01-03 | 2020-07-09 | 深圳壹账通智能科技有限公司 | Abnormal data provision detection method and apparatus based on data migration, and terminal device |
CN110209653A (en) * | 2019-06-04 | 2019-09-06 | 中国农业银行股份有限公司 | HBase data migration method and moving apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN105069128B (en) | 2018-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11238008B2 (en) | Automatic archiving of data store log data | |
CN105069128A (en) | Data synchronization method and apparatus | |
US9229997B1 (en) | Embeddable cloud analytics | |
US9923966B1 (en) | Flexible media storage and organization in automated data storage systems | |
AU2016405587B2 (en) | Splitting and moving ranges in a distributed system | |
DE112012005037B4 (en) | Manage redundant immutable files using deduplications in storage clouds | |
CA2892889C (en) | Scaling computing clusters | |
US7958088B2 (en) | Dynamic data reorganization to accommodate growth across replicated databases | |
US20160156631A1 (en) | Methods and systems for shared file storage | |
CN104580439B (en) | Method for uniformly distributing data in cloud storage system | |
US10698890B2 (en) | Dual overlay query processing | |
CN104978336A (en) | Unstructured data storage system based on Hadoop distributed computing platform | |
US9984139B1 (en) | Publish session framework for datastore operation records | |
US10204021B2 (en) | Recovery of an infected and quarantined file in a primary storage controller from a secondary storage controller | |
CN105095384B (en) | The method and apparatus that data are carried down | |
CN105205154A (en) | Data migration method and device | |
CN109254998B (en) | Data management method, Internet of things equipment, database server and system | |
EP3158478B1 (en) | Embeddable cloud analytics | |
JP6059558B2 (en) | Load balancing judgment system | |
KR102016417B1 (en) | Data server device configured to manage distributed lock of file together with client device in storage system employing distributed file system | |
CN111159140A (en) | Data processing method and device, electronic equipment and storage medium | |
US10379959B1 (en) | Techniques and systems for physical manipulation of data storage devices | |
US11222036B1 (en) | Data warehouse access reporting | |
US10033737B2 (en) | System and method for cross-cloud identity matching | |
US10303553B2 (en) | Providing data backup |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |