CN112069256A - Data synchronization device on server cluster and synchronization method thereof - Google Patents

Data synchronization device on server cluster and synchronization method thereof Download PDF

Info

Publication number
CN112069256A
CN112069256A CN202010877410.XA CN202010877410A CN112069256A CN 112069256 A CN112069256 A CN 112069256A CN 202010877410 A CN202010877410 A CN 202010877410A CN 112069256 A CN112069256 A CN 112069256A
Authority
CN
China
Prior art keywords
data set
directory
difference calculation
descriptor
synchronization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010877410.XA
Other languages
Chinese (zh)
Inventor
刘慧兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010877410.XA priority Critical patent/CN112069256A/en
Publication of CN112069256A publication Critical patent/CN112069256A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data synchronization device on a server cluster and a synchronization method thereof, wherein a difference calculation descriptor and an execution synchronization type are defined, and the difference calculation descriptor and the synchronous execution are obtained in an incremental parallel manner; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, incremental parallel obtaining of the difference computation descriptor and synchronous execution operation are carried out until the whole data set is synchronized. Through the mode, the method and the device can obtain the difference calculation descriptor between the mounting source data set and the local node cache data set in an incremental mode, and carry out corresponding synchronous operation according to the type of the difference calculation descriptor.

Description

Data synchronization device on server cluster and synchronization method thereof
Technical Field
The present invention relates to the field of data set synchronization technologies, and in particular, to a synchronization method and apparatus for data sets applied to a server cluster.
Background
In the era of artificial intelligence big data, more importance is paid to the protection and utilization of data, in order to improve the working efficiency and simplify the operation flow of technical personnel, each mechanism generally carries out unified management and authority maintenance on a data set, and meanwhile, a computing cluster of the mechanism can be established, so that computing and storage resources can be efficiently and conveniently used. When analyzing and processing data, a technician applies for a plurality of computing nodes from a server cluster in the first step; and secondly, acquiring the authority of the corresponding data set, and synchronizing the data set at each node, which mainly ensures the accuracy of data processing and avoids the pressure of network, io and the like caused by directly accessing the mounted source data set. Usually, the mount source data set is pulled to the local node cache directory in a full-scale manner, but when the source data set is not changed greatly or does not change twice, full-scale synchronization wastes both resources (cpu, network, io, etc.) and time, and a phenomenon of 'false death' is caused in a severe case.
Disclosure of Invention
The technical problem mainly solved by the invention is to provide a data synchronization device on a server cluster and a synchronization method thereof, which can reduce resource occupation and shorten synchronization time under the condition of ensuring consistency of a mounted data set and a node cache data set by further refining differences among data sets, defining different difference calculation descriptors and defining different synchronization operations according to the types of the difference calculation descriptors.
In order to solve the technical problems, the invention adopts a technical scheme that: an apparatus for data synchronization on a server cluster is provided, comprising:
a definition module for defining a difference calculation descriptor and executing a synchronization type;
the directory comparison module is used for comparing the difference between the mounted data set and the local node cache data set;
the synchronous execution module executes corresponding operation according to the difference calculation descriptor to synchronize data from the source end to the destination end;
and the resource allocation and control module is used for respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resources of the user task and the size of the data set, and controlling the recovery of the resources.
Further, the directory comparison module comprises a directory traversal unit, a file or directory comparison unit and a task management and progress storage unit; the directory traversal unit is used for acquiring the attribute information of all files in the directory; the file or directory comparison unit is used for comparing a directory structure, the modify time and the size to obtain a difference calculation descriptor; the task management and progress storage unit is used for recording the difference calculation descriptor currently acquired by each comparison task, the task state and the index read by the synchronous execution module.
Further, when synchronizing data, the synchronization execution module updates the modify time of the local node corresponding to the cache data set according to the mount source data set modify time.
A synchronization method of a device for data synchronization on a server cluster defines a difference calculation descriptor and an execution synchronization type, and obtains the difference calculation descriptor and the synchronous execution in an incremental parallel manner; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, performing incremental parallel acquisition of a difference calculation descriptor and synchronous execution operation until the whole data set is synchronous; and when the increment obtains the difference calculation descriptor in parallel and synchronously executes the operation, the local node cache is used as a target path on the basis of mounting the source data set.
The method specifically comprises the following steps:
step 1, defining a catalog comparison difference calculation descriptor and a corresponding synchronous operation type;
step 2, respectively setting the number of directory comparison and synchronous execution threads according to the size of resources and data sets allocated by the tasks;
step 3, processing special conditions of input srcUrl and dstUrl; if dstUrl does not exist, directly utilizing multithreading to carry out copy synchronization; if one of the srcUrl and the dstUrl is not a directory, directly copying the srcUrl in a multi-thread way and deleting the dstUrl;
step 4, for the condition that both srcUrl and dstUrl are catalogues, incrementally and parallelly acquiring the difference calculation descriptors of the catalog comparison, and executing corresponding synchronous operation; and the catalog comparison module updates the task schedule layer by layer and catalog by catalog, the synchronous execution module continuously requests the current difference calculation descriptor of the task and executes the descriptor when the synchronous execution module is idle, and the catalog comparison module and the synchronous execution module operate in parallel.
The invention has the beneficial effects that: the invention can define the processing of comparing the catalog with differentiation by utilizing the inherent attribute information of the file or the catalog in the operating system, thereby not only ensuring the accuracy between the mounted data set and the node cache data set, but also reducing the occupation of resources such as network, io and the like; the invention can more flexibly configure CPU resources and corresponding operations when the data sets are synchronized among the server clusters, and accelerate the synchronization of the data in a controllable mode.
Drawings
FIG. 1 is an architecture diagram of a preferred embodiment of a data synchronization apparatus on a server cluster according to the present invention;
FIG. 2 is a table showing a comparison between a difference calculation descriptor and an execution synchronization type in a synchronization method of a data synchronization apparatus on a server cluster;
FIG. 3 is a flowchart illustrating a directory comparison process in a synchronization method of a data synchronization apparatus on a server cluster;
fig. 4 is a flowchart illustrating incremental parallel acquisition of a difference computation descriptor and synchronous execution in a synchronization method of a data synchronization apparatus on a server cluster.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
The embodiment of the invention comprises the following steps:
a data synchronization apparatus on a server cluster, as shown in fig. 1, for avoiding full copy of the server cluster when synchronizing a data set by defining a difference computation descriptor and executing a synchronization type, and combining a manner of obtaining the difference computation descriptor and executing the synchronization in parallel by an increment, the apparatus comprising: a definition module for defining a difference calculation descriptor and executing a synchronization type; the directory comparison module is used for comparing the difference between the mounted data set and the local node cache data set; the synchronous execution module executes corresponding operation according to the difference calculation descriptor to synchronize data from the source end to the destination end; and the resource allocation and control module is used for respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resources of the user task and the size of the data set, and controlling the recovery of the resources.
A synchronization method of a data synchronization device on a server cluster comprises the following steps: defining a difference calculation descriptor and an execution synchronization type, a directory comparison process, and incrementally and parallelly acquiring the difference calculation descriptor and synchronously executing; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if a cache path exists, the increment obtains the difference computation descriptor in parallel and executes the synchronization operation until the whole data set is synchronized. During synchronous operation, the local node caches as a destination path based on the mount source data set.
Referring to FIG. 2, the defined difference calculation descriptor and corresponding execution synchronization operation type:
(1) when the local node cache data set does not exist, directly copying by utilizing multiple threads, wherein the number of specific threads is dynamically set by a user according to resources and the size of the data set; and when the mount source data set and the local node cache data set are files, directly copying the mount source data set and deleting the local node cache data set.
(2) And when the mounting source data set and the local node cache data set are directories, comparing the data sets at the two ends, and outputting a difference url and a calculation descriptor. If the calculation descriptor of the url is ADD _ TEPE, the fact that the cache path of the local node does not have the url is shown, and direct multi-thread copying is carried out; if the calculation descriptor of the url is DELETE _ TYPE, the url is deleted from the mounting source data, and the corresponding url in the local node cache is directly deleted; if the calculation descriptor of the url is FILE _ CHANGED _ TYPE, the source end and the destination end are inconsistent, and the corresponding FILEs in the local node cache are directly copied and covered; if the url's compute descriptor is DIR _ CHANGED _ TYPE, indicating that this directory is a leaf directory or a hybrid directory, it is directly synchronized incrementally with rsync.
(3) The whole directory comparison and synchronization execution process is in an incremental parallel mode. The method comprises the steps that a process is updated layer by layer and directory by directory, once synchronous execution is idle, a calculation descriptor is obtained, then synchronous operation is executed immediately, and the synchronous operation is executed after all difference calculation descriptors are generated.
The specific implementation process is as follows:
s1, when triggering the data set updating operation, inputting srcUrl, dstUrl and available cpu resources, when dstUrl does not exist, directly multi-thread copying and exiting. All copy operations in this document are processed specially, that is, after synchronizing contents, the node cache data set attribute is set according to the attribute of mount source data url.
S2, when dstUrl exists, comparing srcUrl and dstUrl to see if they are consistent, the catalog comparison flow is shown in FIG. 3; if the two are consistent, the updating operation is not needed, and the operation is directly quitted; if not, calling an adding contrast task interface of the directory contrast module, and storing the returned taskId;
s3, when the attributes of srcUrl and dstUrl are different, for example, srcUrl is a directory and dstUrl is a file, directly adding srcUrl and dstUrl to the difference record table, the corresponding difference calculation descriptors are ADD _ TYPE and DELETE _ TYPE respectively, and calling the synchronous execution module to carry out synchronous operation.
S4, when the attributes of srcUrl and dstUrl are the same and are directories, if the modify time of srcUrl and dstUrl is the same, the srcUrl is traversed, and the directory is traversed in the scheme by directly utilizing the system calling mode, so that the size of the buffer can be set according to the actual directory condition, and the directory is better than the self-contained traversal function of the system under certain conditions. And directly adding the subdirectories into the task queue, comparing whether the modify time and the FILE size of the FILEs in the directories are equal, if not, adding url of the corresponding FILEs into the difference record table, wherein the corresponding difference calculation descriptor is FILE _ CHANGED _ TYPE. After traversing the srcUrl, synchronously updating elements in the difference record table into the task schedule;
s5, if modify times of srqurl and dstUrl are different, first traverse dstUrl:
s51, if dstUrl is leaf directory or mixed directory, adding src Url into difference record list directly, the corresponding difference calculation descriptor is DIR _ CHANGED _ TYPE, at this time, the comparison operation of directory is finished, and the task schedule is updated;
s52, if the dstUrl directory is empty, adding the srUrl to the difference record table directly, wherein the corresponding difference calculation descriptor is ADD _ TYPE, the comparison operation of the directory is finished at this time, and the task schedule is updated;
s53, if the dstUrl directory is a pure directory, i.e. all the child information is directories, then record the names of all the child directories by using the hash table, and then traverse the srrcurl directory:
s531, if the srcUrl contains non-directory information, directly adding the srcUrl into a difference record table, wherein the corresponding difference calculation descriptor is DIR _ CHANGED _ TYPE, the comparison operation of the directory is finished at this time, and the schedule of the task is updated;
s532, if the srcUrl is a pure directory, judging whether the sub-directories of the srcUrl are located in the hash table one by one, if so, adding the sub-directory url into the task queue and setting the flag bit of the corresponding element of the hash table to be 1, if not, adding the sub-directory url into the difference record table, and the corresponding difference calculation descriptor is ADD _ TYPE.
S533, after the traversal of the directory of the layer is completed, adding url with an element flag bit of 0 in the hash table to the difference record table, where the corresponding difference calculation descriptor is DELETE _ TYPE.
S534, synchronously updating the elements in the difference record table into the task schedule, and ending the comparison operation of the directory at this moment.
And S6, obtaining the computation descriptor and executing the synchronous operation in an incremental parallel mode. When the thread performing the synchronization operation is idle, the taskId in step S2 is used to query a directory comparison task schedule in which the schedules of all comparison tasks, i.e., the difference calculation descriptors obtained so far, are saved. Meanwhile, the directory comparison thread acquires the difference calculation descriptor between the mount data set and the node cache data set layer by layer and directory by directory, and updates the schedule in increments, wherein the specific increment parallel acquisition difference and execution flow are shown in fig. 4.
S7, the task queue is recursively traversed and compared, and the process jumps to the step S4 until the task queue is empty and all the difference calculation descriptors are executed, at which time the updating of the whole data set is completed.
Wherein url: a Uniform Resource Locator, which refers to a path of a file or a directory; srcUrl: source Url, which refers to the path on which the data set is mounted; dstUrl: destination Url refers to the path of the local node cache data set.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An apparatus for data synchronization on a cluster of servers, comprising:
a definition module for defining a difference calculation descriptor and executing a synchronization type;
the directory comparison module is used for comparing the difference between the mounted data set and the local node cache data set;
and the synchronous execution module executes corresponding operation according to the difference calculation descriptor so as to synchronize data from the source end to the destination end.
2. The device for data synchronization on a server cluster according to claim 1, further comprising a resource allocation and control module, respectively setting the occupation of each resource of the directory comparison module and the synchronous execution module according to the available resource of the user task and the size of the data set, and controlling the recovery of the resource.
3. The apparatus of claim 2, wherein the apparatus for synchronizing data on a server cluster comprises: the directory comparison module comprises a directory traversal unit, a file or directory comparison unit and a task management and progress storage unit.
4. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: the directory traversal unit is used for acquiring the attribute information of all files in the directory.
5. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: and the file or directory comparison unit is used for comparing the directory structure, the modify time and the size to obtain the difference calculation descriptor.
6. The apparatus of claim 3, wherein the apparatus for data synchronization on the server cluster comprises: the task management and progress storage unit is used for recording the difference calculation descriptor currently acquired by each comparison task, the task state and the index read by the synchronous execution module.
7. The apparatus of claim 5, wherein the apparatus for synchronizing data on a server cluster comprises: and when the synchronous execution module synchronizes data, updating the modify time of the local node corresponding to the cache data set according to the mount source data set modify time.
8. The method for synchronizing data on a server cluster according to any one of the preceding claims, comprising: defining a difference calculation descriptor and executing a synchronous type, and incrementally and parallelly acquiring the difference calculation descriptor and synchronously executing; when the data set is triggered to be updated, judging whether a corresponding cache data set on a local server node exists or not, and directly copying by utilizing multithreading if the corresponding cache data set does not exist; if the cache path exists, incremental parallel obtaining of the difference computation descriptor and synchronous execution operation are carried out until the whole data set is synchronized.
9. The method of claim 8, wherein the data synchronization method comprises: and when the increment obtains the difference calculation descriptor in parallel and synchronously executes the operation, the local node cache is used as a target path on the basis of mounting the source data set.
10. The method according to claim 9, comprising the following steps:
step 1, defining a catalog comparison difference calculation descriptor and a corresponding synchronous operation type;
step 2, respectively setting the number of directory comparison and synchronous execution threads according to the size of resources and data sets allocated by the tasks;
step 3, processing special conditions of input srcUrl and dstUrl; if dstUrl does not exist, directly utilizing multithreading to carry out copy synchronization; if one of the srcUrl and the dstUrl is not a directory, directly copying the srcUrl in a multi-thread way and deleting the dstUrl;
step 4, for the condition that both srcUrl and dstUrl are catalogues, incrementally and parallelly acquiring the difference calculation descriptors of the catalog comparison, and executing corresponding synchronous operation; and the catalog comparison module updates the task schedule layer by layer and catalog by catalog, the synchronous execution module continuously requests the current difference calculation descriptor of the task and executes the descriptor when the synchronous execution module is idle, and the catalog comparison module and the synchronous execution module operate in parallel.
CN202010877410.XA 2020-08-27 2020-08-27 Data synchronization device on server cluster and synchronization method thereof Pending CN112069256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010877410.XA CN112069256A (en) 2020-08-27 2020-08-27 Data synchronization device on server cluster and synchronization method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010877410.XA CN112069256A (en) 2020-08-27 2020-08-27 Data synchronization device on server cluster and synchronization method thereof

Publications (1)

Publication Number Publication Date
CN112069256A true CN112069256A (en) 2020-12-11

Family

ID=73659472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010877410.XA Pending CN112069256A (en) 2020-08-27 2020-08-27 Data synchronization device on server cluster and synchronization method thereof

Country Status (1)

Country Link
CN (1) CN112069256A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641756A (en) * 2021-07-26 2021-11-12 浪潮卓数大数据产业发展有限公司 Distributed high-concurrency data storage method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740418A (en) * 2016-01-29 2016-07-06 杭州亿方云网络科技有限公司 File monitoring and message pushing based real-time synchronization system
CN105975502A (en) * 2016-04-25 2016-09-28 南京优测信息科技有限公司 Method for realizing incremental data extract based on CDC (Change Data Capture) mode
CN106980625A (en) * 2016-01-18 2017-07-25 阿里巴巴集团控股有限公司 A kind of method of data synchronization, apparatus and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980625A (en) * 2016-01-18 2017-07-25 阿里巴巴集团控股有限公司 A kind of method of data synchronization, apparatus and system
CN105740418A (en) * 2016-01-29 2016-07-06 杭州亿方云网络科技有限公司 File monitoring and message pushing based real-time synchronization system
CN105975502A (en) * 2016-04-25 2016-09-28 南京优测信息科技有限公司 Method for realizing incremental data extract based on CDC (Change Data Capture) mode

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641756A (en) * 2021-07-26 2021-11-12 浪潮卓数大数据产业发展有限公司 Distributed high-concurrency data storage method

Similar Documents

Publication Publication Date Title
US20200293547A1 (en) Atomic moves with lamport clocks in a content management system
US11086725B2 (en) Orchestration of heterogeneous multi-role applications
US10909082B2 (en) System and method for policy based synchronization of remote and local file systems
US20230418790A1 (en) System and method for selective synchronization
CN109117425B (en) Method, system, and medium for digital asset synchronization
US11099937B2 (en) Implementing clone snapshots in a distributed storage system
US20220197926A1 (en) Data model and data service for content management system
US8341130B2 (en) Scalable file management for a shared file system
CN109299056B (en) A kind of method of data synchronization and device based on distributed file system
WO2016148670A1 (en) Deduplication and garbage collection across logical databases
CN115587118A (en) Task data dimension table association processing method and device and electronic equipment
CN114528255A (en) Metadata management method, electronic device and computer program product
CN116185962A (en) Data processing method and device based on distributed file system
US20240330259A1 (en) Data model and data service for content management system
CN108984102B (en) Method, system and computer program product for managing a storage system
KR20100073151A (en) Asymetric cluster filesystem
CN112069256A (en) Data synchronization device on server cluster and synchronization method thereof
US11803652B2 (en) Determining access changes
CN116561358A (en) Unified 3D scene data file storage and retrieval method based on hbase
US10073874B1 (en) Updating inverted indices
US11496552B2 (en) Intent tracking for asynchronous operations
Whitehouse The GFS2 filesystem
US11799958B2 (en) Evaluating access based on group membership
CN113901018A (en) Method and device for identifying file to be migrated, computer equipment and storage medium
US11748203B2 (en) Multi-role application orchestration in a distributed storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201211

RJ01 Rejection of invention patent application after publication