CN108540511A

CN108540511A - A kind of method of data synchronization towards Hadoop clusters

Info

Publication number: CN108540511A
Application number: CN201710122295.3A
Authority: CN
Inventors: 杨佩; 胡宏; 王清; 王一清; 罗慧; 刘梅招; 高海龙; 朱力鹏; 胡斌
Original assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Global Energy Interconnection Research Institute
Current assignee: State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd; Global Energy Interconnection Research Institute
Priority date: 2017-03-03
Filing date: 2017-03-03
Publication date: 2018-09-14

Abstract

The present invention proposes a kind of method of data synchronization towards Hadoop clusters, it is included between two Hadoop clusters and establishes secure link, cluster A, which sends synchronization request and sends the private key of oneself, gives cluster B, cluster B is sent to cluster A with the public key of oneself to encryption is obtained key after the private key encryption of cluster A, cluster A receives key and the public key as oneself, then establishes secure attachment.Two clusters to file data blocks calculate separately strong and weak verification and, the documents fast verification and data before data synchronization, power verification and all equal data block are then considered as identical file f ile data blocks, the data of difference are only transmitted when file synchronization is transmitted, identical data block is without transmission.The difference of local data is wanted that addition forms new data from the data of other cluster transmissions after deleting by the cluster for receiving data while receiving variance data.

Description

A kind of method of data synchronization towards Hadoop clusters

Technical field

The present invention relates to the method for data synchronization of middle data synchronization technology in a kind of cloud computing technology storage system, and in particular to A kind of method of data synchronization towards Hadoop clusters.

Background technology

With the development of telling of information age, entire society just steps into " digitlization " epoch, global metadata continue with Explosive speed increases, the new challenge brought to heritage storage system in face of the mass data of sustainable growth.This is a letter It ceases the epoch of explosion, the information on internet is increased with the speed of geometric progression.Under this overall background, CPU is consumed Most calculating is gradually from promoting the software aspect of performance aspect that has been transferred to information processing itself, so that all big enterprises have to face Facing greatly challenge-, they need to excavate useful information from TB or even PB grades of data, and to these mass datas Carry out the processing of quickness and high efficiency.And data storage is the foundation stone of data management work, so how to carry out the storage of big data simultaneously One good problem to study when between different clusters to the migration of big data.Enterprise, which needs to set up height, handles up, is highly reliable Property and expansible storage system.In the daily production and operation, generally can all there be many operation systems in enterprise, in such case Under, enterprise needs to safeguard multiple independent and different between each other filing system, considerably increases the negative of maintenance and management It carries on a shoulder pole and leads to system entirety poor expandability.If building public affairs of the unified data storing platform as multiple operation systems Filing system altogether, then the filing system of all operation systems can be integrated into a system.

The form of such sternness is faced, how we obtain valuable information from mass data, and to these Data progress is efficient, accurately handles.The text that distributed treatment can be carried out to mass data that Hadoop increases income as one Part system and parallel computation programming model.Using Hadoop as big data development platform, using its pseudo- distributed type assemblies form, with HDFS is stored and is read to realize the file of mass data under the distributed file storage system of representative.With the demand of people It is continuous to change, it needs to share data between cluster between each big data service facility, can preferably be promoted in this way Service quality.

Transmission data is needed between two clusters, a solicited message cannot just be pulled up a horse to starting synchronous transfer text without foundation Number of packages evidence, such way be easy to cause unsafe result such as Missing data.

Invention content

To optimize above-mentioned deficiency in the prior art, the object of the present invention is to provide a kind of data towards Hadoop clusters Synchronous method has considered the safety factor of data transmission between cluster, and when needing to transmit file, has judged two clusters Between whether store the different editions information of file, verified and calculated and calculated variance data, and then only carry out to difference The transmission of heteromerism evidence improves the purpose that message transmission rate is improved service quality simultaneously to reach saving network bandwidth.

The purpose of the present invention is what is realized using following technical proposals：

The present invention provides a kind of method of data synchronization towards Hadoop clusters, it is improved in that the method packet Include following step：

Step1：Cluster adds timestamp in storage file data block and realizes Version Control；

Step2：Cluster A when needing data to synchronize for the first time between cluster B to needing to build up mutual trust connection；

Step3：Cluster A generates a random key, is encrypted with the public key for coming from cluster B, by it is encrypted with Secret key is sent to cluster B；

Step4：Cluster B receives the private key ciphertext data after ciphertext with oneself and obtains communication key, and cluster A is between cluster B Establish safe communication connection；

Step5:Fileinfos of the cluster A to cluster B synchronous documents file and the index information in cluster A；

Step6：Cluster B receives the information about the file f ile newly uploaded sent from cluster A, inquires local collection Whether group A has the storage of file f ile to record；

Step7：Judge whether to store the old version of the file f ile sent in cluster A in cluster B；

Step8：Cluster A is compared in locally stored file file for the file f ile that cluster A is transmitted, cluster B Latest edition, the different operation of corresponding selection are determined with after the timestamp information of the file f ile of local cluster storage；

Step9：Compare the timestamp information of cluster A and cluster B, and the fileinfo and index of synchronized clusters A and cluster B Information；

Step10：Determine cluster A and cluster B Alder32 verification and；

Step11：According to Alder32 verifications and the weak check value of cluster A comparison documents two each data blocks of version of file Lookup obtains the identical data block of weak check code, then to the stronger check value MD5 values of the identical data block of weak check value；

Step12：Cluster A determines the variance data information needed, and variance data information is transferred to cluster B.

Further, in the Step1, in a fixed time period T of administrator's setting, cluster NameNode roots The data copy of different editions in cluster is synchronized according to the timestamp information of each data copy；

Added when storage file data block can distinguish after timestamp information the old version of specific file file with it is newest Version, to help administrator to realize the Version Control fast to storage file；During file synchronization, according to by synchronous documents The timestamp information of file data blocks is capable of deciding whether to need to synchronize operation.

Further, in the Step2, the cluster A in two clusters sends the communication request based on https to cluster B； Cluster B sends the trustworthy certificates of oneself after receiving communication request, i.e. public key gives cluster A, i.e. cluster A to be built between cluster B The communication connection of safety is stood；It can lead to when one of cluster increases when new file needs to be synchronized to another cluster Cross information of the secure connection to another collection pocket transmission file.

Further, in the Step4, data are logical in data synchronization process between the cluster A and cluster B in two clusters It crosses the key generated and file transmission is encrypted, so far connection foundation finishes between two clusters.

Further, in the Step5, the copy number of Hadoop cluster default document data blocks is 3, is deposited in addition to local Second copy is stored on other storage nodes of the fast external same machine frame inside of the file of storage；In the storage node of different racks Third copy is stored, stores new file first to target cluster transmission file f ile in the location information of this cluster, cluster B It needs carrying out file transmission when file f ile data；

Step4 ensures to establish secure attachment between two clusters, when user uploads new file f ile in cluster A Later, information of the cluster A to cluster B synchronous documents file and the index information in cluster A, are not directly that file is same Step is transmitted to cluster B.

Further, in the Step6, in cluster B in the old version for storing file f ile before, according to from Whether the information inquiry local cluster of cluster A stored file f ile；If there is no then directly in NameNode with meta file Index informations of the information storage file f ile in cluster A；

When whether inquiry local cluster A has the storage of file f ile to record, if it is not, save file information and text Index information of the part in cluster A；If there is then turning Step8.

Further, in the Step7, index informations of the cluster B save files file in cluster A, as processing user When request task needs file f ile, cluster B sends file transmission request, and cluster A is transmitted after receiving request according to index address File is to cluster B；If storing the old version of file f ile in cluster B, turn Step10；

Cluster B did not store file f ile, index informations of the storage file file in cluster A locally, and user accesses It needs then to send transmission file request to cluster A according to index information when file f ile.

Further, in the Step8,

Cluster B stored the old version of file f ile, the file f ile of cluster B comparison cluster A and local cluster storage Timestamp information, compare and bring out a new edition, if the document time stamp information being locally stored is transmitted earlier than cluster A The index information of file f ile is updated to the index in cluster A by Delete Local File；If the document time being locally stored Stamp newly then turns Step9 than the file that cluster A is transmitted；

If the file f ile update of time stamp of cluster A, the local index information to file of cluster B updates is not right The old version information of the file f ile stored before local cluster does delete operation；If the text of the file storage of local cluster Timestamp is new in the file f ile information that part file timestamp informations come than cluster A synchronous transfers, then cluster B is same to cluster A Walk the storage index information in cluster B about file f ile.

Further, in the Step9, cluster B receives the information of the file f ile from cluster A, and cluster B is locally same Store file f ile, and timestamp information Tb>Ta, cluster B actively send file f ile in local fileinfo to cluster A And index information；Cluster A is received update local fileinfo and index information after fileinfo after complete synchronizing process.

Further, in the Step10, file f ile is transferred to cluster B by cluster A, and file is stored in cluster B The old version of file；According to the size of file, cluster B administrators set an agreement file size D and are sent to cluster A, Cluster A and cluster B calculates Alder32 according to agreement file size D to file data blocks using formula (1)-formula (5) respectively Verification and；The process of agreement file size D values, that is, information is sent between cluster A and cluster B；

Cluster A and cluster B respectively calculates file f ile data blocks according to agreement file size D according to following formula Alder32 verify and：

According to X1 ... the verification of Xn and with the byte flow valuve of X1 to Xn can be simple and quick calculating X2 ... Xn+1 verification With for Adler-32 checking algorithms as weak checking algorithm, formula is as follows：

S (k, l)=a (k, l)+2¹⁶b(k,l) (3)

A (k+1, l+1)=(a (k, l)-X_k+X_l+1)mod M (4)

B (k+1, l+1)=(b (k, l)-(l-K+1) X_k+a(k+1,l+1))mod M (5)

The number for calculating that verification agreement file size is D since first position of file is realized by formula (4) and (5) It is verified according to the rolling of block and calculates the schools MD5 of strong check value i.e. 128 of file data blocks simultaneously according to agreement file size It tests；

Wherein：A (k, l) be cluster A to file f ile data blocks according to agreement file size D it is calculated verification and；s (k, l) be byte rolling verification and；B (k, l) is that cluster B calculates file f ile data blocks according to agreement file size D Verification and；X_iRepresent file block and byte stream；K, l is the initial and end caps of a range, and the range is arranged The length range that file size D includes；M is that the mould of modulus in calculating process is long, the long M=2^16 of setting mould；X_kIndicate that number is k File f ile data blocks, X_l+1It is l+1 for file f ile data blocks number.

Further, in the Step11, the strong and weak check value of file f ile data blocks is transferred to cluster A by cluster B, collection (promoters of the cluster A as file synchronization, the ends cluster B calculate needs two synchronous texts to two versions of group A comparison documents file Difference between part, after variance data information is fed back a cluster A by calculating after finishing, cluster A obtains feedack, right The file of the version more different than two, i.e. new file in A and file to be synchronized in B, are the identical file of two versions) it is each The weak check value of data block is searched to obtain the identical data block of weak check code, then stronger to the identical data block of weak check value Check value MD5 values think that the content of two data blocks is consistent if also identical, repeat to search to the end of file.

Further, in the Step12, cluster A needs to transmit by two different editions determination of documents file Variance data information, variance data information is transferred to cluster B, variance data information includes to calculate to learn in step step10 Check code different data block and its variance data corresponding position hereof index information；Cluster B is tied after receiving information It closes the file of the old version composition latest edition of the file f ile of local cluster storage and starts to provide a user service.

Compared with the immediate prior art, the excellent effect that technical solution provided by the invention has is：

Method of data synchronization provided by the invention towards Hadoop clusters, it is first when needing data transmission between cluster Secure link is established between two clusters, then starts data transmission again after secure link has established, first from source It is upper to solve unsafe factor.Start to be transmitted file after establishing secure link.The data of Hadoop cluster-based storages Amount is often very big, if necessary to the very big words of file data amount of transmission, not only occupies a large amount of network bandwidth, and And user experience is affected, increase the service response time of system.We establish theft-resistant link chain between two clusters in the present invention Data block while connecing and then to storage carries out strong and weak verification and calculating.Each data block has unique a pair of strong in cluster Weak data check and, file transmission start before, first compare two clusters between same file different editions number of files According to block verification and, find out the data of difference and identical data.Start only transmit difference data to when transmission data another Outer cluster.Occupying for network bandwidth in data transmission procedure can not only be reduced in this way, also reduce the time of file synchronization transmission, Faster response file request.The overall quality of service of raising system.

Description of the drawings

Fig. 1 is the flow chart of the method for data synchronization provided by the invention towards Hadoop clusters.

Specific implementation mode

The specific implementation mode of the present invention is described in further detail below in conjunction with the accompanying drawings.

The following description and drawings fully show specific embodiments of the present invention, to enable those skilled in the art to Put into practice them.Other embodiments may include structure, logic, it is electrical, process and other change.Embodiment Only represent possible variation.Unless explicitly requested, otherwise individual component and function are optional, and the sequence operated can be with Variation.The part of some embodiments and feature can be included in or replace part and the feature of other embodiments.This hair The range of bright embodiment includes equivalent obtained by the entire scope of claims and all of claims Object.Herein, these embodiments of the invention can individually or generally be indicated that this is only with term " invention " For convenience, it and if in fact disclosing the invention more than one, is not meant to automatically limit ranging from appointing for the application What single invention or inventive concept.

Embodiment one,

The embodiment of the present invention proposes, needs first to establish theft-resistant link chain between two clusters when data transmission between cluster It connects, then starts data transmission again after secure link has established, solve unsafe factor from source first.It establishes Start to be transmitted file after secure link.

Fig. 1 shows the flow chart of the method for data synchronization provided in an embodiment of the present invention towards Hadoop clusters, such as Fig. 1 It is shown, include the following steps：

Step1:Cluster adds timestamp in storage file data block and realizes Version Control.At one of administrator's setting In fixed time period T, cluster NameNode is according to the timestamp information of each data copy to the data of different editions in cluster Copy synchronizes.

The data block of file system storage is added to the old version that specific file file can be distinguished after timestamp information With latest edition, to help administrator to realize the Version Control fast to storage file.During file synchronization, according to same The fast timestamp information of step file is capable of deciding whether to need to synchronize operation.

Step2:Connection of building up mutual trust is needed when needing data to synchronize for the first time between two clusters.Cluster A is sent to cluster B Communication request based on https.Cluster B sends the trustworthy certificates (public key) of oneself to cluster A after receiving communication request.

Https is with safely for the channels HTTP of target, and the foundation for security of Https is SSL, therefore encrypted detailed content Just need SSL.SSL is a kind of security protocol for providing safety and data integrity for network communication.To ensure The upper data transmission securities of Internet.

Step3:Cluster A generates a random key, is encrypted with the public key for coming from cluster B, by it is encrypted with Secret key is sent to cluster B.

Step4：Cluster B receives the private key ciphertext data after ciphertext with oneself and obtains communication key.Number between two clusters File transmission is encrypted by this key generated according to data in synchronizing process.Connection is established and is finished between two clusters.

Safety factor is the factor considered first during file synchronization between different clusters, after establishing secure connection File synchronization transmission is guaranteed.

Step5：After user uploads new file f ile in cluster A, cluster A is to cluster B synchronous documents file's Information and the index information in cluster A.It is not that file is directly transmitted to cluster B.

The copy number of Hadoop cluster default document data blocks is 3, in addition to the fast external same machine of the file being locally stored Second copy is stored on other storage nodes inside frame.Third copy is stored in the storage node of different racks.Copy Quantity is enough to safeguard the high availability of Hadoop clusters.New file is stored first to target cluster transmission file files in Ben Ji The location information of group, cluster B need carrying out file transmission when file file datas.

Step6：Cluster B receives the fileinfo from cluster A transmission, and whether inquiry local cluster stores file file.If it is not, the index information of save file information and file in cluster A.If there is going to step 8.

It may be in the old version for storing file files before, according to the information inquiry local from cluster A in cluster B Whether cluster stored this document.If there is no then directly in NameNode with meta file information storage file files in cluster Index information in A.

Step7：Cluster B saves index informations of the file f ile in cluster A, when processing user's request task needs text When part file, cluster B sends file transmission request, and cluster A is received and transmitted file to cluster B according to index address after request.Such as The old version that file f ile was stored in fruit cluster B, goes to step 10；

In previous step, cluster B locally without storage file files, stores index letters of the file f ile in cluster A Breath, user access and then send transmission file request to cluster A according to index information when needing file f ile.

Step8:For the file f ile that cluster A is transmitted, cluster B is locally also storing file f ile.B pairs of cluster Than the timestamp information of two files, compare and bring out a new edition, if the document time stamp information being locally stored is passed earlier than cluster A The defeated then Delete Local File to come, the index in cluster A is updated to by the index information of file f ile.If be locally stored Document time stamp newly then go to step 9 than the file that cluster A is transmitted.

Cluster B stored the old version of file f ile, compared cluster A and local cluster storage file files when Between determine which version is latest edition after stamp information, the different operation of corresponding selection.If the file document times of cluster A Stamp update, then cluster B updates are locally to the index information of file.Old version not to the file stored before local cluster Information does delete operation；What if the file timestamp informations of the file storage of local cluster came than cluster A synchronous transfers Timestamp is new in file information, then cluster B synchronizes the storage index information about file files in cluster B to cluster A.

Step9：Cluster B receives the information of the file f ile from cluster A, and cluster B locally equally stores file f ile, And time stamp T b>Ta, cluster B actively send file in local fileinfo and index information to cluster A.Cluster A is received Synchronizing process is completed after updating local file storage and index information after fileinfo.

Step10：File f ile is transferred to cluster B by cluster A, and the old version of file f ile is stored in cluster B.Two A cluster respectively to file data blocks according to agreement file size D according to following formula calculate Alder32 verification and.

S (k, l)=a (k, l)+2¹⁶b(k,l) (3)

Wherein s (k, l) be byte rolling verification and.In order to quickly calculate as a result, M=2^16 is generally arranged.This The advantages of checksum algorithm is effectively to utilize recurrence relation, can quickly calculate successive value.

A (k+1, l+1)=(a (k, l)-X_k+X_l+1)mod M (4)

B (k+1, l+1)=(b (k, l)-(l-K+1) X_k+a(k+1,l+1))mod M (5)

Wherein：X_iRepresent file block and byte stream；K, l is the initial and end caps of a range, and length is agreement Length D；M is that the mould of the modulus in calculating process is long；X_kIndicate the file f ile data blocks that number is k, X_l+1For file f ile Data block number is l+1

It can be calculated since first position of file using seldom calculation amount to realize by formula (4) and (5) Check length be D data block rolling verification and.Then the strong of file data blocks is calculated simultaneously according to agreement file size The MD5 of check value i.e. 128 is verified.

MD5 check value computation rules are as follows：

Md5 encryption method recycles four-wheel altogether, by four-wheel calculating message will be ground namely believe at 128 cryptographic Hash The abstract of breath.Four-wheel cycle uses following four formula respectively：

Step11：The strong and weak check value of file file data blocks is transferred to cluster A, cluster A comparison documents file by cluster B The weak check value of two each data blocks of version is searched to obtain the identical data block of weak check code, then to the identical number of weak check value According to the stronger check value MD5 values of block, think that the content of two data blocks is consistent if also identical.It repeats to search to file Terminate.

Cluster A compared the information of two versions, the different information between two versions of file files be can determine, in text As long as part transmits variance data at the beginning of transmitting be that can guarantee the information for having complete newest file files in cluster B.

Step12：Cluster A determines the variance data that needs transmit by two different editions of comparison file files, will These information are transferred to cluster B, and information includes to calculate the check code different data block learnt and its difference number in step step10 According to the index information of corresponding position hereof.Cluster B receives going through for the file files for combining local cluster to store after information The file of history version composition latest edition simultaneously starts to provide a user service.

There is identical data certainly between the different editions of File files, by way of comparing check value, finds Variance data between two versions of file files synchronizes transmission, has saved network bandwidth, has reduced transmission time, accelerates Access response times of the cluster B to user.The old version that cluster B stores file files only needs to be transmitted according to cluster A The information and variance data to come over carries out file files the operation of similar recombination, forms the newest version of file files, to Start to think that user provides file service.

The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, although with reference to above-described embodiment pair The present invention is described in detail, those of ordinary skill in the art still can to the present invention specific implementation mode into Row modification either equivalent replacement these without departing from any modification of spirit and scope of the invention or equivalent replacement, applying Within the claims of the pending present invention.

Claims

1. a kind of method of data synchronization towards Hadoop clusters, which is characterized in that the method includes following step：

Step 1：Cluster adds timestamp in storage file data block and realizes Version Control；

Step 2：Cluster A when needing data to synchronize for the first time between cluster B to needing to build up mutual trust connection；

Step 3：Cluster A generates a random key, is encrypted with the public key for coming from cluster B, by encrypted with secret Key is sent to cluster B；

Step4：Cluster B is established with the private key ciphertext data acquisition communication key of oneself, cluster A between cluster B after receiving ciphertext The communication connection of safety；

Step6：Cluster B receives the information about the file f ile newly uploaded sent from cluster A, inquiry local cluster A Whether there is the storage of file f ile to record；

Step 7：Judge whether to store the old version of the file f ile sent in cluster A in cluster B；

Step 8：For the file f ile that cluster A is transmitted, cluster B in locally stored file file, and compare cluster A and Latest edition, the different operation of corresponding selection are determined after the timestamp information of the file f ile of local cluster storage；

Step 9：Compare the timestamp information of cluster A and cluster B, and the fileinfo of synchronized clusters A and cluster B and index letter Breath；

Step 10：Determine cluster A and cluster B Alder32 verification and；

Step 11：According to Alder32 verifications and the weak check value lookup of two each data blocks of version of cluster A comparison documents file The identical data block of weak check code is obtained, then to the stronger check value MD5 values of the identical data block of weak check value；

Step 12：Cluster A determines the variance data information needed, and variance data information is transferred to cluster B.

2. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 1, In a fixed time period T of administrator's setting, cluster NameNode is according to the timestamp information of each data copy to collection The data copy of different editions synchronizes in group；

The old version and latest edition of specific file file can be distinguished when storage file data block after addition timestamp information, To help administrator to realize the Version Control fast to storage file；During file synchronization, according to by synchronous documents file The timestamp information of data block is capable of deciding whether to need to synchronize operation.

3. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 2, Cluster A in two clusters sends the communication request based on https to cluster B；Cluster B is sent certainly after receiving communication request Oneself trustworthy certificates, i.e. public key give cluster A, i.e. cluster A is to establishing safe communication connection between cluster B；When therein one A cluster increases can collect pocket transmission by secure connection when new file needs to be synchronized to another cluster to another The information of file.

4. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 4, File transmission is encrypted by the key of generation in data in data synchronization process between cluster A and cluster B in two clusters, So far connection foundation finishes between two clusters.

5. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 5, The copy number of Hadoop cluster default document data blocks is 3, in addition to the fast external same machine frame inside of the file that is locally stored Second copy is stored on other storage nodes；Third copy is stored in the storage node of different racks, stores new file First to target cluster transmission file f ile this cluster location information, into style of writing when cluster B needs file f ile data Part transmits；

Step4 ensures to establish secure attachment between two clusters, when in cluster A user upload new file f ile it Afterwards, information of the cluster A to cluster B synchronous documents file and the index information in cluster A, are not directly by file synchronization It is transmitted to cluster B.

6. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 6, In cluster B in the old version for storing file f ile before, whether deposited according to the information inquiry local cluster from cluster A Stored up file f ile；If there is no the rope then directly in NameNode with meta file information storage file f ile in cluster A Fuse ceases；

When whether inquiry local cluster A has the storage of file f ile to record, if it is not, save file information and file exist Index information in cluster A；If there is then turning Step 8.

7. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 7, Index informations of the cluster B save files file in cluster A, when processing user's request task needs file f ile, cluster B hairs Send file transmission request, cluster A is received and transmitted file to cluster B according to index address after request；If storing text in cluster B The old version of part file then turns Step 10；

Cluster B did not store file f ile, index informations of the storage file file in cluster A locally, and user, which accesses, to be needed Transmission file request is then sent to cluster A according to index information when file f ile.

8. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 8,

Cluster B stored the old version of file f ile, the file f ile of cluster B comparison cluster A and local cluster storage when Between stab information, compare and bring out a new edition, the deletion if the document time stamp information being locally stored is transmitted earlier than cluster A The index information of file f ile is updated to the index in cluster A by local file；If the document time stamp ratio being locally stored The file that cluster A is transmitted newly then turns Step 9；

If the file f ile update of time stamp of cluster A, the local index information to file of cluster B updates, not to local The old version information of the file f ile stored before cluster does delete operation；If the file of the file storage of local cluster Timestamp is new in the file f ile information that file timestamp informations come than cluster A synchronous transfers, then cluster B is synchronized to cluster A About file f ile cluster B storage index information.

9. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that in the Step 9, Cluster B receives the information of the file f ile from cluster A, and cluster B locally equally stores file f ile, and timestamp information Tb >Ta, cluster B actively send file f ile in local fileinfo and index information to cluster A；Cluster A receives fileinfo Synchronizing process is completed after updating local fileinfo and index information afterwards.

10. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that the Step 10 In, file f ile is transferred to cluster B by cluster A, and the old version of file f ile is stored in cluster B；According to the size of file, Cluster B administrators set an agreement file size D and are sent to cluster A, cluster A and cluster B and pressed respectively to file data blocks As agreed file size D using formula (1)-formula (5) calculate Alder32 verification and；Arrange the process of file size D values That is information is sent between cluster A and cluster B；

According to X1 ... the verification of Xn and with the byte flow valuve of X1 to Xn can be simple and quick calculating X2 ... Xn+1 verification and, For Adler-32 checking algorithms as weak checking algorithm, formula is as follows：

S (k, l)=a (k, l)+2¹⁶b(k,l) (3)

A (k+1, l+1)=(a (k, l)-X_k+X_l+1)mod M (4)

B (k+1, l+1)=(b (k, l)-(l-K+1) X_k+a(k+1,l+1))mod M (5)

The data block for calculating that verification agreement file size is D since first position of file is realized by formula (4) and (5) Rolling verification and, according to agreement file size simultaneously calculate file data blocks strong check value i.e. 128 MD5 verify；

Wherein：A (k, l) be cluster A to file f ile data blocks according to agreement file size D it is calculated verification and；S (k, l) Be byte rolling verification and；B (k, l) is for cluster B to file f ile data blocks according to the agreement calculated verifications of file size D With；X_iRepresent file block and byte stream；K, l is the initial and end caps of a range, and it is long that the range arranges file The length range that degree D includes；M is that the mould of modulus in calculating process is long, the long M=2^16 of setting mould；X_kIndicate the file that number is k File data blocks, X_l+1It is l+1 for file f ile data blocks number.

11. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that the Step 11 In, the strong and weak check value of file f ile data blocks is transferred to cluster A, two each numbers of version of cluster A comparison documents file by cluster B It searches to obtain the identical data block of weak check code according to the weak check value of block, then to the stronger school of the identical data block of weak check value Value MD5 values are tested, think that the content of two data blocks is consistent if also identical, repeat to search to the end of file.

12. the method for data synchronization as described in claim 1 towards Hadoop clusters, which is characterized in that the Step 12 In, cluster A determines the variance data information for needing to transmit by two different editions of documents file, and variance data is believed Breath is transferred to cluster B, and variance data information includes to calculate the check code different data block learnt and its difference in step step10 The index information of data corresponding position hereof；Cluster B receives the file f ile's for combining local cluster to store after information The file of old version composition latest edition simultaneously starts to provide a user service.