CN112069256A

CN112069256A - Data synchronization device on server cluster and synchronization method thereof

Info

Publication number: CN112069256A
Application number: CN202010877410.XA
Authority: CN
Inventors: 刘慧兴
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-12-11

Abstract

The invention discloses a data synchronization device on a server cluster and a synchronization method thereof, wherein a difference calculation descriptor and an execution synchronization type are defined, and the difference calculation descriptor and the synchronous execution are obtained in an incremental parallel manner; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, incremental parallel obtaining of the difference computation descriptor and synchronous execution operation are carried out until the whole data set is synchronized. Through the mode, the method and the device can obtain the difference calculation descriptor between the mounting source data set and the local node cache data set in an incremental mode, and carry out corresponding synchronous operation according to the type of the difference calculation descriptor.

Description

Data synchronization device on server cluster and synchronization method thereof

Technical Field

The present invention relates to the field of data set synchronization technologies, and in particular, to a synchronization method and apparatus for data sets applied to a server cluster.

Background

In the era of artificial intelligence big data, more importance is paid to the protection and utilization of data, in order to improve the working efficiency and simplify the operation flow of technical personnel, each mechanism generally carries out unified management and authority maintenance on a data set, and meanwhile, a computing cluster of the mechanism can be established, so that computing and storage resources can be efficiently and conveniently used. When analyzing and processing data, a technician applies for a plurality of computing nodes from a server cluster in the first step; and secondly, acquiring the authority of the corresponding data set, and synchronizing the data set at each node, which mainly ensures the accuracy of data processing and avoids the pressure of network, io and the like caused by directly accessing the mounted source data set. Usually, the mount source data set is pulled to the local node cache directory in a full-scale manner, but when the source data set is not changed greatly or does not change twice, full-scale synchronization wastes both resources (cpu, network, io, etc.) and time, and a phenomenon of 'false death' is caused in a severe case.

Disclosure of Invention

The technical problem mainly solved by the invention is to provide a data synchronization device on a server cluster and a synchronization method thereof, which can reduce resource occupation and shorten synchronization time under the condition of ensuring consistency of a mounted data set and a node cache data set by further refining differences among data sets, defining different difference calculation descriptors and defining different synchronization operations according to the types of the difference calculation descriptors.

In order to solve the technical problems, the invention adopts a technical scheme that: an apparatus for data synchronization on a server cluster is provided, comprising:

a definition module for defining a difference calculation descriptor and executing a synchronization type;

the directory comparison module is used for comparing the difference between the mounted data set and the local node cache data set;

the synchronous execution module executes corresponding operation according to the difference calculation descriptor to synchronize data from the source end to the destination end;

and the resource allocation and control module is used for respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resources of the user task and the size of the data set, and controlling the recovery of the resources.

Further, the directory comparison module comprises a directory traversal unit, a file or directory comparison unit and a task management and progress storage unit; the directory traversal unit is used for acquiring the attribute information of all files in the directory; the file or directory comparison unit is used for comparing a directory structure, the modify time and the size to obtain a difference calculation descriptor; the task management and progress storage unit is used for recording the difference calculation descriptor currently acquired by each comparison task, the task state and the index read by the synchronous execution module.

Further, when synchronizing data, the synchronization execution module updates the modify time of the local node corresponding to the cache data set according to the mount source data set modify time.

A synchronization method of a device for data synchronization on a server cluster defines a difference calculation descriptor and an execution synchronization type, and obtains the difference calculation descriptor and the synchronous execution in an incremental parallel manner; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, performing incremental parallel acquisition of a difference calculation descriptor and synchronous execution operation until the whole data set is synchronous; and when the increment obtains the difference calculation descriptor in parallel and synchronously executes the operation, the local node cache is used as a target path on the basis of mounting the source data set.

The method specifically comprises the following steps:

step 1, defining a catalog comparison difference calculation descriptor and a corresponding synchronous operation type;

step 2, respectively setting the number of directory comparison and synchronous execution threads according to the size of resources and data sets allocated by the tasks;

step 3, processing special conditions of input srcUrl and dstUrl; if dstUrl does not exist, directly utilizing multithreading to carry out copy synchronization; if one of the srcUrl and the dstUrl is not a directory, directly copying the srcUrl in a multi-thread way and deleting the dstUrl;

step 4, for the condition that both srcUrl and dstUrl are catalogues, incrementally and parallelly acquiring the difference calculation descriptors of the catalog comparison, and executing corresponding synchronous operation; and the catalog comparison module updates the task schedule layer by layer and catalog by catalog, the synchronous execution module continuously requests the current difference calculation descriptor of the task and executes the descriptor when the synchronous execution module is idle, and the catalog comparison module and the synchronous execution module operate in parallel.

The invention has the beneficial effects that: the invention can define the processing of comparing the catalog with differentiation by utilizing the inherent attribute information of the file or the catalog in the operating system, thereby not only ensuring the accuracy between the mounted data set and the node cache data set, but also reducing the occupation of resources such as network, io and the like; the invention can more flexibly configure CPU resources and corresponding operations when the data sets are synchronized among the server clusters, and accelerate the synchronization of the data in a controllable mode.

Drawings

FIG. 1 is an architecture diagram of a preferred embodiment of a data synchronization apparatus on a server cluster according to the present invention;

FIG. 2 is a table showing a comparison between a difference calculation descriptor and an execution synchronization type in a synchronization method of a data synchronization apparatus on a server cluster;

FIG. 3 is a flowchart illustrating a directory comparison process in a synchronization method of a data synchronization apparatus on a server cluster;

fig. 4 is a flowchart illustrating incremental parallel acquisition of a difference computation descriptor and synchronous execution in a synchronization method of a data synchronization apparatus on a server cluster.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

The embodiment of the invention comprises the following steps:

a data synchronization apparatus on a server cluster, as shown in fig. 1, for avoiding full copy of the server cluster when synchronizing a data set by defining a difference computation descriptor and executing a synchronization type, and combining a manner of obtaining the difference computation descriptor and executing the synchronization in parallel by an increment, the apparatus comprising: a definition module for defining a difference calculation descriptor and executing a synchronization type; the directory comparison module is used for comparing the difference between the mounted data set and the local node cache data set; the synchronous execution module executes corresponding operation according to the difference calculation descriptor to synchronize data from the source end to the destination end; and the resource allocation and control module is used for respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resources of the user task and the size of the data set, and controlling the recovery of the resources.

A synchronization method of a data synchronization device on a server cluster comprises the following steps: defining a difference calculation descriptor and an execution synchronization type, a directory comparison process, and incrementally and parallelly acquiring the difference calculation descriptor and synchronously executing; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if a cache path exists, the increment obtains the difference computation descriptor in parallel and executes the synchronization operation until the whole data set is synchronized. During synchronous operation, the local node caches as a destination path based on the mount source data set.

Referring to FIG. 2, the defined difference calculation descriptor and corresponding execution synchronization operation type:

(1) when the local node cache data set does not exist, directly copying by utilizing multiple threads, wherein the number of specific threads is dynamically set by a user according to resources and the size of the data set; and when the mount source data set and the local node cache data set are files, directly copying the mount source data set and deleting the local node cache data set.

(2) And when the mounting source data set and the local node cache data set are directories, comparing the data sets at the two ends, and outputting a difference url and a calculation descriptor. If the calculation descriptor of the url is ADD _ TEPE, the fact that the cache path of the local node does not have the url is shown, and direct multi-thread copying is carried out; if the calculation descriptor of the url is DELETE _ TYPE, the url is deleted from the mounting source data, and the corresponding url in the local node cache is directly deleted; if the calculation descriptor of the url is FILE _ CHANGED _ TYPE, the source end and the destination end are inconsistent, and the corresponding FILEs in the local node cache are directly copied and covered; if the url's compute descriptor is DIR _ CHANGED _ TYPE, indicating that this directory is a leaf directory or a hybrid directory, it is directly synchronized incrementally with rsync.

(3) The whole directory comparison and synchronization execution process is in an incremental parallel mode. The method comprises the steps that a process is updated layer by layer and directory by directory, once synchronous execution is idle, a calculation descriptor is obtained, then synchronous operation is executed immediately, and the synchronous operation is executed after all difference calculation descriptors are generated.

The specific implementation process is as follows:

s1, when triggering the data set updating operation, inputting srcUrl, dstUrl and available cpu resources, when dstUrl does not exist, directly multi-thread copying and exiting. All copy operations in this document are processed specially, that is, after synchronizing contents, the node cache data set attribute is set according to the attribute of mount source data url.

S2, when dstUrl exists, comparing srcUrl and dstUrl to see if they are consistent, the catalog comparison flow is shown in FIG. 3; if the two are consistent, the updating operation is not needed, and the operation is directly quitted; if not, calling an adding contrast task interface of the directory contrast module, and storing the returned taskId;

s3, when the attributes of srcUrl and dstUrl are different, for example, srcUrl is a directory and dstUrl is a file, directly adding srcUrl and dstUrl to the difference record table, the corresponding difference calculation descriptors are ADD _ TYPE and DELETE _ TYPE respectively, and calling the synchronous execution module to carry out synchronous operation.

S4, when the attributes of srcUrl and dstUrl are the same and are directories, if the modify time of srcUrl and dstUrl is the same, the srcUrl is traversed, and the directory is traversed in the scheme by directly utilizing the system calling mode, so that the size of the buffer can be set according to the actual directory condition, and the directory is better than the self-contained traversal function of the system under certain conditions. And directly adding the subdirectories into the task queue, comparing whether the modify time and the FILE size of the FILEs in the directories are equal, if not, adding url of the corresponding FILEs into the difference record table, wherein the corresponding difference calculation descriptor is FILE _ CHANGED _ TYPE. After traversing the srcUrl, synchronously updating elements in the difference record table into the task schedule;

s5, if modify times of srqurl and dstUrl are different, first traverse dstUrl:

s51, if dstUrl is leaf directory or mixed directory, adding src Url into difference record list directly, the corresponding difference calculation descriptor is DIR _ CHANGED _ TYPE, at this time, the comparison operation of directory is finished, and the task schedule is updated;

s52, if the dstUrl directory is empty, adding the srUrl to the difference record table directly, wherein the corresponding difference calculation descriptor is ADD _ TYPE, the comparison operation of the directory is finished at this time, and the task schedule is updated;

s53, if the dstUrl directory is a pure directory, i.e. all the child information is directories, then record the names of all the child directories by using the hash table, and then traverse the srrcurl directory:

s531, if the srcUrl contains non-directory information, directly adding the srcUrl into a difference record table, wherein the corresponding difference calculation descriptor is DIR _ CHANGED _ TYPE, the comparison operation of the directory is finished at this time, and the schedule of the task is updated;

s532, if the srcUrl is a pure directory, judging whether the sub-directories of the srcUrl are located in the hash table one by one, if so, adding the sub-directory url into the task queue and setting the flag bit of the corresponding element of the hash table to be 1, if not, adding the sub-directory url into the difference record table, and the corresponding difference calculation descriptor is ADD _ TYPE.

S533, after the traversal of the directory of the layer is completed, adding url with an element flag bit of 0 in the hash table to the difference record table, where the corresponding difference calculation descriptor is DELETE _ TYPE.

S534, synchronously updating the elements in the difference record table into the task schedule, and ending the comparison operation of the directory at this moment.

And S6, obtaining the computation descriptor and executing the synchronous operation in an incremental parallel mode. When the thread performing the synchronization operation is idle, the taskId in step S2 is used to query a directory comparison task schedule in which the schedules of all comparison tasks, i.e., the difference calculation descriptors obtained so far, are saved. Meanwhile, the directory comparison thread acquires the difference calculation descriptor between the mount data set and the node cache data set layer by layer and directory by directory, and updates the schedule in increments, wherein the specific increment parallel acquisition difference and execution flow are shown in fig. 4.

S7, the task queue is recursively traversed and compared, and the process jumps to the step S4 until the task queue is empty and all the difference calculation descriptors are executed, at which time the updating of the whole data set is completed.

Wherein url: a Uniform Resource Locator, which refers to a path of a file or a directory; srcUrl: source Url, which refers to the path on which the data set is mounted; dstUrl: destination Url refers to the path of the local node cache data set.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An apparatus for data synchronization on a cluster of servers, comprising:

and the synchronous execution module executes corresponding operation according to the difference calculation descriptor so as to synchronize data from the source end to the destination end.

2. The device for data synchronization on a server cluster according to claim 1, further comprising a resource allocation and control module, respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resource of the user task and the size of the data set, and controlling the recovery of the resource.

3. The apparatus of claim 2, wherein the apparatus for synchronizing data on a server cluster comprises: the directory comparison module comprises a directory traversal unit, a file or directory comparison unit and a task management and progress storage unit.

4. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: the directory traversal unit is used for acquiring the attribute information of all files in the directory.

5. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: and the file or directory comparison unit is used for comparing the directory structure, the modify time and the size to obtain the difference calculation descriptor.

6. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: the task management and progress storage unit is used for recording the difference calculation descriptor currently acquired by each comparison task, the task state and the index read by the synchronous execution module.

7. The apparatus of claim 5, wherein the apparatus for synchronizing data on a server cluster comprises: and when the synchronous execution module synchronizes data, updating the modify time of the local node corresponding to the cache data set according to the mount source data set modify time.

8. The method for synchronizing data on a server cluster according to any one of the preceding claims, comprising: defining a difference calculation descriptor and executing a synchronous type, and incrementally and parallelly acquiring the difference calculation descriptor and synchronously executing; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, incremental parallel obtaining of the difference computation descriptor and synchronous execution operation are carried out until the whole data set is synchronized.

9. The method of claim 8, wherein the data synchronization method comprises: and when the increment obtains the difference calculation descriptor in parallel and synchronously executes the operation, the local node cache is used as a target path on the basis of mounting the source data set.

10. The method according to claim 9, comprising the following steps: