CN118210770A

CN118210770A - Incremental data synchronization method, device, computer equipment and storage medium

Info

Publication number: CN118210770A
Application number: CN202410265314.8A
Authority: CN
Inventors: 余剑; 杨维敏; 杜鹏; 马立珂; 王子骏
Original assignee: Anhui Dingjia Computer Technology Co ltd
Current assignee: Anhui Dingjia Computer Technology Co ltd
Priority date: 2024-03-08
Filing date: 2024-03-08
Publication date: 2024-06-18

Abstract

The application relates to an incremental data synchronization method, an incremental data synchronization device, computer equipment and a storage medium, and relates to the technical field of data backup. The method comprises the following steps: partitioning the incremental file based on the variable window and the tangent point to obtain a plurality of partitioned data, and determining fingerprint data corresponding to each partitioned data; generating first fingerprint metadata of the delta file based on file data and fingerprint data of the delta file; under the condition that the second fingerprint metadata of the increment file is the same as the third fingerprint metadata of the increment file, determining target increment data according to respective fingerprint data groups of the first fingerprint metadata and the second fingerprint metadata, and synchronizing the target increment data to the server; the second fingerprint metadata is the fingerprint metadata of the last time the delta file was synchronized in the local cache; the third fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the server. By adopting the method, the efficiency of file synchronization between the client and the server can be improved.

Description

Incremental data synchronization method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of data backup technology, and in particular, to an incremental data synchronization method, an incremental data synchronization device, a computer device, a storage medium, and a computer program product.

Background

As the volume of personal data has exploded, more and more users choose to store or backup data to the cloud. Generally, a user stores data on a local device such as a personal computer and a mobile phone, and then backs up the data to a storage center of a cloud end through a synchronization method. The related art adopts an incremental synchronization technology to sense the changed data of the file, and only the changed part in the file is uploaded in each synchronous transmission. However, when the existing incremental synchronization technology synchronizes data from a client to a server, it is required to determine file data that changes, so that multiple communications occur between the client and the server, thereby increasing computing resources and IO loads of the server, and resulting in insufficient efficiency of file synchronization between the client and the server.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, computer device, computer readable storage medium, and computer program product for incremental data synchronization.

In a first aspect, the present application provides a method for incremental data synchronization. The method comprises the following steps:

Partitioning the incremental file based on the variable window and the tangent point to obtain a plurality of partitioned data, and determining fingerprint data corresponding to each partitioned data; the variable window and the tangent point are determined based on a block strategy corresponding to the incremental file;

Generating first fingerprint metadata of the delta file based on file data of the delta file and the fingerprint data; the first fingerprint metadata comprises summary information of the delta file and a fingerprint data group formed by the fingerprint data;

Under the condition that the second fingerprint metadata of the incremental file are the same as the third fingerprint metadata of the incremental file, determining target incremental data according to respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata, and synchronizing the target incremental data to a server; the second fingerprint metadata is the fingerprint metadata of the delta file last synchronized in a local cache; the third fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the server.

In one embodiment, the partitioning processing is performed on the delta file based on the variable window and the tangent point to obtain a plurality of partitioned data, and determining fingerprint data corresponding to each partitioned data includes:

Sequentially traversing byte sequences corresponding to the increment files based on the window size of the variable window to obtain current byte data contained in the variable window; the window size is dynamically changed based on the maximum value and entropy corresponding to the current byte data;

determining a tangent point corresponding to the current byte data under the condition that the current byte data meets a preset partitioning strategy, so as to obtain a plurality of tangent points corresponding to the byte sequence;

partitioning the incremental file based on a plurality of the tangent points to obtain a plurality of partitioned data corresponding to the incremental file;

and generating fingerprints of the plurality of block data based on a hash algorithm to obtain fingerprint data corresponding to the block data.

In one embodiment, the summary information of the delta file includes catalog summary information and data summary information of the delta file; the generating the first fingerprint metadata of the delta file based on the file data of the delta file and the fingerprint data includes:

obtaining catalog abstract information corresponding to the increment file; the catalog abstract information is obtained by calculating a root node path of the incremental file based on an MD5 algorithm;

Acquiring data abstract information corresponding to the increment file; the data abstract information is obtained by calculating file data of the incremental file based on an MD5 algorithm;

Determining a fingerprint data group corresponding to the fingerprint data based on a segmentation window, a tangent point and a segmentation sequence corresponding to each fingerprint data;

and combining the catalog abstract information, the data abstract information and the fingerprint data to obtain the first fingerprint metadata of the increment file.

In one embodiment, before determining the target delta data according to the fingerprint data set of each of the first fingerprint metadata and the second fingerprint metadata, the method further includes:

Acquiring catalog digest information and data digest information of the second fingerprint metadata and acquiring catalog digest information and data digest information of the third fingerprint metadata under the condition that the version number in the second fingerprint metadata is the same as the version number in the third fingerprint metadata;

and if the correspondence between the two pieces of catalog abstract information and the correspondence between the two pieces of data abstract information are the same, determining that the second fingerprint metadata are the same as the third fingerprint metadata.

In one embodiment, the determining the target delta data according to the fingerprint data set of each of the first fingerprint metadata and the second fingerprint metadata includes:

Determining updated fingerprint data corresponding to the first fingerprint metadata based on the fingerprint data set in the first fingerprint metadata and the fingerprint data set in the second fingerprint metadata; the updated fingerprint data is used to represent fingerprint data of a change in the fingerprint data set of the first fingerprint metadata relative to the fingerprint data set of the second fingerprint metadata;

And determining file data corresponding to the updated fingerprint data in the increment file, and taking the file data as target increment data.

In one embodiment, the synchronizing the target incremental data to the server includes:

And sending the target incremental data to the server to trigger the server to receive the target incremental data under the condition that the fingerprint data of the target incremental data are not stored in a fingerprint database of the server.

In one embodiment, after the synchronizing the target incremental data to the server, the method further includes:

and synchronizing the first fingerprint metadata to the local cache, and updating the second fingerprint metadata to obtain updated second fingerprint metadata.

In a second aspect, the application further provides an incremental data synchronization device. The device comprises:

The file blocking module is used for blocking the increment file based on the variable window and the tangent point to obtain a plurality of blocking data and determining fingerprint data corresponding to each blocking data; the variable window and the tangent point are determined based on a block strategy corresponding to the incremental file;

The metadata generation module is used for generating first fingerprint metadata of the incremental file based on file data of the incremental file and the fingerprint data; the first fingerprint metadata comprises summary information of the delta file and a fingerprint data group formed by the fingerprint data;

The data synchronization module is used for determining target incremental data according to respective fingerprint data groups of the first fingerprint metadata and the second fingerprint metadata under the condition that the second fingerprint metadata of the incremental file are identical to the third fingerprint metadata of the incremental file, and synchronizing the target incremental data to a server; the second fingerprint metadata is the fingerprint metadata of the delta file last synchronized in a local cache; the third fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the server.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method according to the first aspect when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to the first aspect.

The incremental data synchronization method, the incremental data synchronization device, the computer equipment, the storage medium and the computer program product divide the incremental file into blocks through the variable window and the tangent points to obtain a plurality of block data, and respectively determine fingerprint data corresponding to each block data. The fingerprint data is summary information of the block data, and changes when the block data changes. And the client generates first fingerprint metadata corresponding to the incremental file according to the file data and the fingerprint data of the incremental file, wherein the first fingerprint metadata comprises abstract information of the incremental file and fingerprint data of each block data in the incremental file. The summary information and the fingerprint data may change in real time as the delta file changes, based on which the client may determine whether the delta file changes based on comparing the first fingerprint metadata. In addition, the client comprises a local cache storing second fingerprint metadata, the server stores third fingerprint metadata, and under the condition that the second fingerprint metadata and the third fingerprint metadata are the same, the target increment data can be determined according to the difference of the fingerprint data groups in the first fingerprint metadata and the second fingerprint metadata, and the target increment data is sent to the server. Because the client and the server only need to compare whether the second fingerprint metadata and the third fingerprint metadata are the same, the communication times of the client and the server are reduced, and the computing resource and IO load of the server are reduced, so that the efficiency of file synchronization of the client and the server is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is an application environment diagram of an incremental data synchronization method in one embodiment;

FIG. 2 is a flow chart of a method of incremental data synchronization in one embodiment;

FIG. 3 is a flow chart illustrating the steps of determining fingerprint data in one embodiment;

FIG. 4 is a flowchart illustrating the steps of generating first fingerprint metadata in one embodiment;

FIG. 5 is a flow chart of a method of incremental data synchronization according to another embodiment;

FIG. 6 is a flowchart illustrating a step of determining fingerprint data according to another embodiment;

FIG. 7 is a flowchart illustrating a step of determining fingerprint data in yet another embodiment;

FIG. 8 is a schematic diagram of incremental data synchronization in one embodiment;

FIG. 9 is a block diagram of an incremental data synchronizer in one embodiment;

fig. 10 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The incremental data synchronization method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The client side communicates with the server side through a network, the client side can upload target incremental data and fingerprint data corresponding to the incremental file to the server side, and the server side determines whether the incremental data corresponding to the fingerprint data are stored in a database of the server or not through comparison of the fingerprint database, so that whether the target incremental data are received or not is determined. The client can block the incremental file, process the blocked data into fingerprint data and obtain first fingerprint metadata corresponding to the incremental file; the local cache stores the second fingerprint metadata updated last time; the client may receive the third fingerprint metadata stored in the server and determine whether the fingerprint metadata of the client is valid by comparing the second fingerprint metadata in the client with the third fingerprint metadata in the server. The client can determine the target incremental data by comparing the first fingerprint metadata and the second fingerprint metadata corresponding to the incremental file. The client may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers.

In an exemplary embodiment, as shown in fig. 2, an incremental data synchronization method is provided, and the method is applied to the client in fig. 1 for illustration, and includes the following steps S202 to S206. Wherein:

step S202, partitioning the increment file based on the variable window and the tangent point to obtain a plurality of partitioned data, and determining fingerprint data corresponding to each partitioned data.

The variable window is a segmentation window for segmenting file data of an incremental file, the size of the window can be dynamically changed along with the segmentation process, the tangent point is used for determining the starting point and the ending point of the variable window, and the segmentation point is segmented to obtain segmented data with the corresponding size of the variable window; the variable window and the tangent point are determined based on a partitioning strategy corresponding to the delta file, and the partitioning strategy can determine the window size of the variable window and the specific position of the tangent point. The fingerprint data is used for uniquely representing the corresponding block data, and the fingerprint data may be a signature value corresponding to the block data.

Specifically, the client obtains file data of the delta file, which may include a plurality of bytes. The client traverses the file data according to the arrangement sequence of the file data and the partitioning strategy from beginning to end, sequentially determines the tangent point corresponding to each section of byte based on the variable window to obtain a plurality of tangent points, and partitions the file data according to the tangent points to obtain a plurality of partitioned data. The client calculates the signature value of each piece of block data, so as to obtain fingerprint data corresponding to each piece of block data.

Step S204, generating first fingerprint metadata of the incremental file based on the file data and the fingerprint data of the incremental file.

Wherein the first fingerprint metadata includes summary information of the delta file and a fingerprint data set composed of fingerprint data. Summary information may be used to uniquely identify the delta file. The summary information may include a version number of the delta file, an update timestamp, directory summary information, and data summary information.

Specifically, the server determines summary information of the incremental file according to file data of the incremental file, and merges the fingerprint data to obtain a fingerprint data set corresponding to the incremental file. And packaging the abstract information of the increment file and the fingerprint data group to obtain first fingerprint metadata.

In step S206, when the second fingerprint metadata of the incremental file is the same as the third fingerprint metadata of the incremental file, the target incremental data is determined according to the respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata, and the target incremental data is synchronized to the server.

Wherein the second fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the local cache; the third fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the server. The first fingerprint metadata, the second fingerprint metadata and the third fingerprint metadata have the same data structure and are all fingerprint metadata determined based on the incremental file.

Specifically, the client obtains second fingerprint metadata in the local cache. The client acquires third fingerprint metadata stored in the server. And comparing the second fingerprint metadata with the third fingerprint metadata to determine whether the second fingerprint metadata and the third fingerprint metadata are identical. If the second fingerprint metadata and the third fingerprint metadata are identical, the description may determine a distinguishing point between the first fingerprint metadata and the third fingerprint metadata by determining a distinguishing point between the first fingerprint metadata and the second fingerprint metadata. Based on the above, the client may acquire the fingerprint data set in the first fingerprint metadata and acquire the fingerprint data set in the second fingerprint metadata, determine different fingerprint data in the fingerprint data sets by comparing the fingerprint data in the two fingerprint data sets, and determine corresponding target delta data based on the different fingerprint data. The client sends the target incremental data and fingerprint data corresponding to the target incremental data to the server, and the server receives the target incremental data and stores the target incremental data in the database.

Optionally, the server receives fingerprint data corresponding to the target incremental data and stores the fingerprint data in the fingerprint database.

Optionally, the server receives the first fingerprint metadata, and updates the third fingerprint metadata based on the first fingerprint metadata, so as to obtain updated third fingerprint metadata.

Optionally, when the second fingerprint metadata of the incremental file is different from the third fingerprint metadata of the incremental file, the client receives the third fingerprint metadata sent by the server, and updates the second fingerprint metadata in the local cache based on the third fingerprint metadata, so as to obtain updated second fingerprint metadata.

In the incremental data synchronization method, the incremental file is segmented through the variable window and the tangent points to obtain a plurality of segmented data, and fingerprint data corresponding to each segmented data is respectively determined. The fingerprint data is summary information of the block data, and changes when the block data changes. And the client generates first fingerprint metadata corresponding to the incremental file according to the file data and the fingerprint data of the incremental file, wherein the first fingerprint metadata comprises abstract information of the incremental file and fingerprint data of each block data in the incremental file. The summary information and the fingerprint data may change in real time as the delta file changes, based on which the client may determine whether the delta file changes based on comparing the first fingerprint metadata. In addition, the client comprises a local cache storing second fingerprint metadata, the server stores third fingerprint metadata, and under the condition that the second fingerprint metadata and the third fingerprint metadata are the same, the target increment data can be determined according to the difference of the fingerprint data groups in the first fingerprint metadata and the second fingerprint metadata, and the target increment data is sent to the server. Because the client and the server only need to compare whether the second fingerprint metadata and the third fingerprint metadata are the same, the communication times of the client and the server are reduced, and the computing resource and IO load of the server are reduced, so that the efficiency of file synchronization of the client and the server is improved.

In an exemplary embodiment, as shown in fig. 3, the step of "performing a blocking process on the delta file based on the variable window and the tangent point to obtain a plurality of block data, and determining fingerprint data corresponding to each block data" includes steps S302 to S308. Wherein:

step S302, based on the window size of the variable window, sequentially traversing the byte sequence corresponding to the increment file to obtain the current byte data contained in the variable window.

Wherein the window size is dynamically changed based on the maximum value and entropy corresponding to the current byte data.

Specifically, the client acquires a byte sequence corresponding to the delta file, traverses from the first byte of the byte sequence, determines the selected current byte data based on the window size of the variable window, and the byte number of the current byte data corresponds to the window size.

Step S304, determining the tangent point corresponding to the current byte data under the condition that the current byte data meets a preset partitioning strategy, so as to obtain a plurality of tangent points corresponding to the byte sequence.

Specifically, the client calculates the maximum value in the current byte data, calculates the entropy value corresponding to the current byte data, and determines whether the maximum value and the entropy value meet the condition of setting the tangent point. If the condition for setting the tangent point is satisfied, the tangent point is set at the starting point and the ending point of the variable window. For example, if the maximum value is less than the first threshold and/or the entropy value is less than the second threshold, it is determined that the maximum value and the entropy value satisfy the condition for setting the cut point. In one example, the tangent point corresponding to the starting point of the variable window need not satisfy the condition for setting the tangent point.

After determining a pair of points corresponding to the variable window, the client can adjust the window size of the variable window according to the partitioning strategy and the maximum value and the entropy value corresponding to the last variable window. The client moves the variable window and reads the current byte data from the next byte according to the window size of the variable window. The above-described tangent point determination process is repeated until the traversal of the entire byte sequence is completed.

In one example, the window size of the variable window may be incremented or decremented on a byte-by-byte basis for smaller byte sequences. For larger byte sequences, the window size of the variable window may be incremented or decremented by an exponential law of 2.

And step S306, the incremental file is segmented based on the plurality of tangent points, and a plurality of segmented data corresponding to the incremental file are obtained.

Specifically, the client traverses each pair of tangent points, and blocks the current pair of tangent points according to the two tangent points in the byte sequence to obtain current block data. And after the client side completes the traversal of each pair of tangent points, obtaining a plurality of block data. In one example, since the client moves the variable window by one byte when determining the tangent point, there may be overlapping portions of the resulting multiple pieces of chunk data.

Step S308, fingerprint generation is carried out on the plurality of block data based on a hash algorithm, and fingerprint data corresponding to the block data are obtained.

Specifically, the client calculates each piece of block data by adopting a hash algorithm, and determines a signature value, namely an identifier, corresponding to each piece of block data. And finally, obtaining fingerprint data corresponding to each piece of block data.

In this embodiment, the byte sequence of the delta file is sequentially traversed, and the entropy and the maximum value corresponding to the current byte data are respectively determined according to the variable window, so that the tangent point of the variable window at the moment is determined, and finally, a plurality of tangent points of the whole byte sequence are obtained. And then the blocks are carried out according to the plurality of tangent points to obtain a plurality of block data, and fingerprint data of each block data are determined through a hash algorithm. The window size of the variable window can be changed in real time, so that the byte drift resistance of the byte sequence can be improved, and the accuracy of determining the incremental data can be improved.

In an exemplary embodiment, the summary information of the delta file includes directory summary information and data summary information of the delta file, and as shown in fig. 4, step "generates first fingerprint metadata of the delta file based on file data and fingerprint data of the delta file" includes steps S402 to S408. Wherein:

step S402, obtaining catalog abstract information corresponding to the increment file.

The directory summary information is obtained by calculating a root node path of the incremental file based on an MD5 (Message-Digest Algorithm 5) Algorithm. The root path refers to the absolute path of the delta file in the client.

Specifically, the client acquires a root node path of the incremental file, adopts an MD5 algorithm to determine a signature value corresponding to the root node path, and uses the signature value as target abstract information corresponding to the incremental file.

Step S404, obtaining the data abstract information corresponding to the increment file.

The data summary information is obtained by calculating file data of the incremental file based on an MD5 algorithm.

Specifically, the client acquires file data of the incremental file, adopts an MD5 algorithm to determine a signature value corresponding to the file data, and takes the signature value as data abstract information corresponding to the incremental file.

Step S406, determining a fingerprint data set corresponding to the fingerprint data based on the segmentation window, the tangent point and the segmentation order corresponding to each fingerprint data.

The segmentation window is a window corresponding to the client when the client segments according to the tangent points to obtain fingerprint data.

Specifically, the client determines the byte length of the sliced fingerprint data based on the window size of the slice window. The client determines the starting position of the splitting window based on the tangent point. The client acquires the segmentation sequence corresponding to each fingerprint data and determines the index number of each fingerprint data. The index number, the starting position, the byte length and the fingerprint data are combined into a complete fingerprint data packet. Finally, a fingerprint data set formed by a plurality of fingerprint data packets is obtained.

In step S408, the catalog digest information, the data digest information and the fingerprint data are combined to obtain the first fingerprint metadata of the delta file.

Specifically, the catalog abstract information and the data abstract information of the incremental file are combined to serve as the abstract information of the incremental file, and then the abstract information is combined with the fingerprint data to obtain the first fingerprint metadata of the incremental file.

In this embodiment, the MD5 determines the directory summary information and the data summary information of the incremental file respectively, where the directory summary information and the data summary information can each uniquely represent the incremental file, and based on this, the fingerprint data set including the index number, the start position, the byte length and the fingerprint data is combined with the directory summary information and the data summary information to obtain the first fingerprint metadata, so that the information of the incremental file and the information of each fingerprint data in the first fingerprint metadata can be quickly determined, thereby improving the subsequent comparison efficiency with other fingerprint metadata.

In an exemplary embodiment, before the step of determining the target delta data from the respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata, the method further comprises:

Under the condition that the version number in the second fingerprint metadata is the same as that in the third fingerprint metadata, acquiring directory abstract information and data abstract information of the second fingerprint metadata, and acquiring directory abstract information and data abstract information in the third fingerprint metadata;

and if the correspondence between the two catalogue abstract information and the correspondence between the two data abstract information are the same, determining that the second fingerprint metadata are the same as the third fingerprint metadata.

The version number is an identification number of the incremental data generated when the fingerprint metadata of the incremental data is determined again after the incremental data is updated with the file data. In general, the larger the version number, the closer the delta data is to the latest version.

Specifically, the client acquires third fingerprint metadata corresponding to the incremental file from the server, and acquires second fingerprint metadata corresponding to the incremental file from the local cache. The client compares whether the version number in the second fingerprint metadata is the same as the version number in the third fingerprint metadata, and if the version number in the second fingerprint metadata is the same as the version number in the third fingerprint metadata, obtains catalog abstract information of the second fingerprint metadata and catalog abstract information of the third fingerprint metadata, and further confirms whether the catalog abstract information is completely the same as the catalog abstract information. Meanwhile, the data abstract information of the second fingerprint metadata and the data abstract information of the third fingerprint metadata are acquired, and whether the two data abstract information are identical or not is further confirmed. When the two pieces of catalog digest information are identical and the two pieces of data digest information are identical, it is determined that the second fingerprint metadata is identical to the third fingerprint metadata.

In this embodiment, by determining whether the second fingerprint metadata in the local cache is the same as the third fingerprint metadata of the server, the fingerprint metadata of the current version of the incremental file can be determined in advance at the client, and by comparing the version number, the directory summary information and the data summary information, the communication data volume and the communication times between the client and the server can be reduced, and the comparison result can be obtained quickly.

In an exemplary embodiment, the specific implementation of the step of determining the target delta data according to the respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata includes:

and determining updated fingerprint data corresponding to the first fingerprint metadata based on the fingerprint data set in the first fingerprint metadata and the fingerprint data set in the second fingerprint metadata.

Wherein the updated fingerprint data is used to represent fingerprint data of a change in the fingerprint data set of the first fingerprint metadata relative to the fingerprint data set of the second fingerprint metadata.

Specifically, the client acquires a fingerprint data set in the first fingerprint metadata corresponding to the delta file and a fingerprint data set in the second fingerprint metadata corresponding to the delta file. Traversing the fingerprint data group of the first fingerprint metadata according to the index number of each fingerprint data, and determining whether the current fingerprint data of the first fingerprint metadata is identical to the fingerprint data contained in the second fingerprint metadata. If the current fingerprint data is the same as the fingerprint data contained in the second fingerprint metadata, not processing the current fingerprint data; if the fingerprint data in the second fingerprint metadata does not contain the current fingerprint data, determining that the current fingerprint data is updated fingerprint data; and if the current fingerprint data is different from the fingerprint data contained in the second fingerprint metadata, determining that the current fingerprint data is updated fingerprint data. And after the first fingerprint metadata is traversed, obtaining updated fingerprint data formed by a plurality of fingerprint data. And the client determines file data corresponding to the updated fingerprint data in the partitioned data corresponding to the incremental file according to the updated fingerprint data, so as to obtain target incremental data.

Alternatively, the fingerprint data set in the first fingerprint metadata may have additional fingerprint data, modified fingerprint data, deleted fingerprint data, etc. with respect to the fingerprint data set in the second fingerprint metadata. If the newly added fingerprint data exists, determining target increment data corresponding to the newly added fingerprint data, and sending the target increment data to the server. If the modified fingerprint data exist, determining target modification data corresponding to the modified fingerprint data, and sending the target modification data to the server, wherein the server covers the original data with the target modification data. If the fingerprint data to be deleted exists, the deleted fingerprint data is sent to the server, target deletion data corresponding to the deleted fingerprint data is determined at the server, and then target incremental data stored in the database is deleted.

In this embodiment, by comparing the fingerprint data set in the first fingerprint metadata with the fingerprint data set in the second fingerprint metadata, it is determined that there is changed updated fingerprint data, and the target incremental data corresponding to the updated fingerprint data is determined, and the target incremental data can be determined by comparing the fingerprint data, so that the determination of the target incremental data is completed without interaction with the server side, and the efficiency of determining the target incremental data is improved.

In an exemplary embodiment, the specific implementation process of step "synchronize target incremental data to a server" includes:

And sending the target incremental data to the server to trigger the server to receive the target incremental data under the condition that the fingerprint data of the target incremental data are not stored in the fingerprint database of the server.

The fingerprint database is a database of the server side and is used for storing fingerprint metadata of each incremental file.

Specifically, the client sends the target incremental data and fingerprint data corresponding to the target incremental data to the server, and the server searches in the fingerprint database to determine whether the fingerprint data corresponding to the target incremental data is stored in the fingerprint database. If the server determines that the fingerprint data of the target incremental data is stored in the fingerprint database of the server, the target incremental data is stored in the database of the server, and the target incremental data sent by the client does not need to be received. If the server determines that the fingerprint data of the target incremental data is not stored in the fingerprint database of the server, receiving the target incremental data and the fingerprint data corresponding to the target incremental data, storing the target incremental data in the database, and storing the fingerprint data corresponding to the target incremental data in the fingerprint database.

In this embodiment, the server retrieves in the fingerprint database to determine whether the fingerprint data of the target incremental data is stored, so as to determine whether the server receives the target incremental data, thereby avoiding the server from repeatedly storing the target incremental data, and improving the accuracy and stability of file synchronization.

In an exemplary embodiment, after the step of synchronizing the target delta data to the server, the method further comprises:

and synchronizing the first fingerprint metadata to a local cache, and updating the second fingerprint metadata to obtain updated second fingerprint metadata.

Specifically, the client stores first fingerprint metadata of the incremental file to the local cache, and overlays second fingerprint metadata corresponding to the incremental file with the first fingerprint metadata to obtain updated second fingerprint metadata.

Optionally, the first fingerprint metadata is synchronized to the server, and the server updates the third fingerprint metadata based on the first fingerprint metadata to obtain updated third fingerprint metadata.

In this embodiment, by updating the second fingerprint metadata, it can be ensured that the locally cached fingerprint metadata is the latest version of fingerprint metadata, so that the comparison between the second fingerprint metadata and the third fingerprint metadata of the server is facilitated, and therefore, the stability of data synchronization between the client and the server is improved.

As shown in fig. 5, the following describes in detail a specific implementation procedure of the incremental data synchronization method according to one embodiment, including the following steps:

Step 1, partitioning the delta file of the client, in one embodiment, the process may use CDC (Content Defined Chunking) algorithm to select the appropriate tangent point and window size.

And 2, circularly generating a hash fingerprint value for each piece of block data by using an MD5 algorithm, and generating fingerprint metadata of a current version.

And 3, reading the fingerprint metadata of the last version of the local cache.

And step 4, verifying whether the validity of the local cache information is outdated. And determining whether Version number Version, data summary information (DFP) and directory summary information (PFP) in the fingerprint metadata in the local cache are the same as the cloud, and if so, firstly returning the fingerprint metadata of the server to the local cache so as to process conflict.

And step 5, comparing the fingerprint data (Chunk FP) of the newly generated file block (incremental data) with the fingerprint data of the last version.

And 6, uploading the fingerprint data corresponding to the data to be updated to the server, and comparing the fingerprint data with a shared fingerprint database of the server, wherein the shared fingerprint database contains the fingerprint data of all the data contained in the synchronized increment files, so that the data size of the uploaded increment data is further reduced. After comparison, the incremental data actually required to be uploaded by the server and fingerprint data corresponding to the incremental data are determined.

Step 7, the client receives the increment data and the fingerprint data required by the server and sends the relevant fingerprint data and increment data to the server;

step 8, after the server receives the data, generating MD5 verification fingerprints for the new incremental files, namely data summary information DFP, and transmitting the data summary information DFP back to the client;

And 9, the client receives the data abstract information of the new incremental file, compares the data abstract information with the local incremental file, and generates the latest fingerprint metadata and updates the latest fingerprint metadata to the local cache and the fingerprint database of the client after successful verification.

The term interpretation of some nouns is as follows:

Tangent point: the final purpose of the data chunking algorithm is to determine where the data is to be cut. The data partitioning algorithm is executed in a process of searching for a cutting position called a tangent point, namely the tangent point is position information.

Blocking Chunk: after determining the cut points, the byte stream between two adjacent cut points in the data is called a chunk. And the tangent point belongs to a block formed by the tangent point and the previous tangent point.

Byte drift: assuming that there is a data a, the data is partitioned using a data partitioning algorithm, a list LA of cut points is generated. Then, a Byte is inserted or deleted at a certain position in the data a to form a new data B, and the same data blocking algorithm is used to block the data B, so as to obtain a tangent point list LB. If all the tangent points following a Byte in list LB have changed, then it is said that a Byte drift has occurred. And the data chunking algorithm is considered to be without anti-byte-drift capability.

Incremental data: in the case of incremental data collection, if two data are compared, the set of difference bytes in the two data is the most accurate increment, denoted set N. The data are partitioned by using a data partitioning algorithm, then the byte set contained in the difference block obtained by comparison is denoted as set M, and then N is a subset of M. The incremental data mentioned after this section refers to the set M. That is, the data increment collected by the data chunking algorithm is of all difference bytes.

Source and delta data (source and cloud): in the data synchronization process, a node with changed data initiates a synchronization request, and the node with unchanged data receives the synchronization request and synchronizes the data of the node into the data of the data change end. In the primary synchronization process, the data changed in the node which initiates the synchronization request is referred to as incremental data, and the data which does not change in the node which receives the synchronization request is referred to as source data.

In an exemplary embodiment, the step "chunks delta file for client" is implemented as shown in fig. 6.

Before incremental data collection, the file needs to be partitioned by means of a data partitioning algorithm. In the process of partitioning data, it is first necessary to determine the appropriate positions of the tangential points in the file, where the set of tangential points forms { offset }, and the data between every two tangential points (offsets) _i-1~(offset)_i is a partition, so each window size is (offset) _i-(offset)_i-1. The data blocking algorithm can be divided into fixed length blocks and content-based data blocks, and the fixed length blocks are not generally adopted when incremental data are acquired due to the problem of byte drift. The specific implementation process is as follows:

Step 1.1, starting from the first byte of data, selecting adjacent data with the window size, and if the section of data meets a specific condition, setting a tangent point at the last byte of the data window;

Step 1.2, selecting adjacent data with the window size from the next byte of the window, and continuously searching for a tangent point;

Step 1.3, if not, moving the window one byte backwards;

Step 1.4, continuing to judge until a window meeting a specific condition is found or data is finished;

step 1.5, generating a Fingerprint (FP) _i from the data set { data } i within each window using a Hash algorithm.

Wherein fingerprint information of each block data (Chunk) is represented by an array mode, index represents a sequence number, CFP represents summary information of a single block, offset represents a tangential point position of the block in the data, namely an Offset position, and Length represents the Length of each Chunk.

In data increment collection, the obtained blocks of the data blocking algorithm are only used for finding that the increment is not stored in a physical disk, so that the stability of the block length is not high compared with the data deleting technology. To achieve improved resistance to byte drift, increased accuracy in incremental data discovery, a CDC algorithm may be employed in one specific implementation to determine the size of the cut point and window.

In one example, the implementation procedure of the step of "blocking the incremental file of the client" includes, as shown in fig. 7:

Step 1.1.1, start: the first byte of the byte stream is read and an initial window size 1024 is set.

Step 1.1.2, maximum value detection: the maximum value is found within the current window. The calculation formula is as follows:

Max_Value = max(byte_array) or max(int_array)

Step 1.1.3, entropy calculation: and calculating the entropy of the data in the current window, and reflecting the change intensity of the data. The calculation formula is as follows:

H(x) =

is the probability of the i-th element in the data set.

Step 1.1.4, condition inspection: and judging whether the condition for setting the tangential point is satisfied or not according to the calculation result of the maximum value and the entropy.

If the initial tangential point is not detected, the initial tangential point is necessarily the tangential point, and if the initial tangential point is not the initial tangential point, the tangential point is detected, and the detection rule is set: the maximum value is less than the threshold and/or the entropy is less than the threshold, and the current position is set as the tangent point.

Step 1.1.5, window adjustment: the window size is dynamically adjusted according to the maximum value and the entropy, the window size can be increased/reduced in a byte-by-byte mode for small files, and the large files such as virtual machines can be increased/reduced according to the index rule of 2 according to the characteristics of the files.

The mutation and complexity of the data can be captured in a combination of two features during the adjustment process. The cut point and window size are set by the maximum value and entropy of the data within the current data window and according to the variation of these features.

Step 1.1.6, moving the window, reading the next byte, repeating steps 1.1.2 to 1.1.6.

Step 1.1.7, data flow ends: when the complete byte stream is read, the algorithm ends.

In one example, a fingerprint metadata model using JSON format is presented as follows.

{

Ver Version number

TS TIME STAMP timestamp

DFP File Data Figure Print File data fingerprint data

PFP File Path Figure Print File Path fingerprint data

Total count=n Total number TC

},

[

{

Index 1

CFP (Chunk Figure Print) 1 fingerprint data of block data 1

Offset 0 tangent point position

Length (Chunk Length) 1 fingerprint data Length

}

…

{

Index N

CFP (Chunk Figure Print) N fingerprint data of block data N

Offset (File Offset) N N tangential point position

Length (Chunk Length) N N fingerprint data Length

}

]

The root node of the structure uses MD5 (PFP) of the path and MD5 (DFP) of the file content to identify directory summary information and data summary information of the file, when partial scenes cannot monitor whether the file is changed, a client can compare the two values of the PFP and the DFP with a local cache to obtain which file is changed preferentially, and then incremental synchronization is carried out on each file.

Meanwhile, the data can be added with information such as a data version number and change time, and when incremental data are synchronized, whether the data change or not can be judged according to fingerprint metadata, so that the number of data to be synchronized in a data list is reduced. Where Ver represents the version number of the file, TS represents the update timestamp of the file, and TC represents the format of Chunk information in the data.

The fingerprint information of each Chunk is represented by an array, index represents an Index number, CFP represents summary information of a single Chunk, offset represents a tangential point position of the Chunk in data, namely an Offset position, and Length represents the Length of each Chunk.

The fingerprint metadata may also be represented using a binary system, where the size of the fingerprint data FP is 128 bytes and the sequence number, cut point and length information is 32 bytes, so that the size of the data transferred and stored may be further reduced in the binary mode.

In one example, as shown in FIG. 8, an example of the operation of file delta synchronization is illustrated.

At an initial time t1, the version of the source file (delta file) is 1. At this time, the file is divided into blocks and blocks of chunk 1-N, and the local cache and the server lack the original fingerprint metadata information. Typically, in this case, all chunk data for chunk 1N and corresponding fingerprint metadata will be generated. At time t2, the version of the source file is upgraded to 2. During this time, the chunk1 of the file is modified to chunk1'. To ensure that the server side backup data is consistent with the source data, the client side initiates a synchronization request. By comparison with fingerprint metadata of version=1 on the local/server, only the data of chunk1' is uploaded in case incremental synchronization is employed. After the uploading is successful, the server and the client update the fingerprint metadata.

At time t3, the version of the source file is 3. At this time, the chunk1' in the file underwent the delete operation. The client initiates the synchronization request, but in the case of incremental synchronization, the data uploading behavior is not generated because the existing data is not changed. After synchronizing the file information, the fingerprint metadata of the server and the client are updated.

At time t4, the version of the source file is 4. At this time, the file is appended with chunkN +1 data blocks. The client initiates a synchronization request and in case of incremental synchronization, only the appended content block chunkN +1 is uploaded. After synchronizing the file, the fingerprint metadata of the server and the client are updated again.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an incremental data synchronization device for realizing the above-mentioned incremental data synchronization method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiment of one or more incremental data synchronization devices provided below may be referred to the limitation of the incremental data synchronization method hereinabove, and will not be repeated herein.

In an exemplary embodiment, as shown in fig. 9, there is provided an incremental data synchronization apparatus 900 comprising: a file blocking module 901, a metadata generation module 902, and a data synchronization module 903, wherein:

the file blocking module 901 is configured to block the incremental file based on the variable window and the tangent point to obtain a plurality of block data, and determine fingerprint data corresponding to each block data; the variable window and the tangent point are determined based on a block strategy corresponding to the incremental file;

a metadata generation module 902, configured to generate first fingerprint metadata of the delta file based on file data of the delta file and the fingerprint data; the first fingerprint metadata comprises summary information of the delta file and a fingerprint data group formed by the fingerprint data;

The data synchronization module 903 is configured to determine, when the second fingerprint metadata of the incremental file is the same as the third fingerprint metadata of the incremental file, target incremental data according to respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata, and synchronize the target incremental data to a server; the second fingerprint metadata is the fingerprint metadata of the delta file last synchronized in a local cache; the third fingerprint metadata is the fingerprint metadata of the delta file last synchronized in the server.

Further, the file blocking module 901 is specifically configured to:

Further, the summary information of the incremental file includes directory summary information and data summary information of the incremental file, and the metadata generation module 902 is specifically configured to:

Further, the device further comprises a judging module, specifically configured to:

Further, the data synchronization module 903 is specifically configured to:

Further, the data synchronization module 903 is specifically configured to: and sending the target incremental data to the server to trigger the server to receive the target incremental data under the condition that the fingerprint data of the target incremental data are not stored in a fingerprint database of the server.

Further, the device further comprises a cache synchronization module, specifically configured to: and synchronizing the first fingerprint metadata to the local cache, and updating the second fingerprint metadata to obtain updated second fingerprint metadata.

The various modules in the incremental data synchronization apparatus described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In an exemplary embodiment, a computer device, which may be a terminal, is provided, and an internal structure thereof may be as shown in fig. 10. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of incremental data synchronization. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an exemplary embodiment, a computer device is also provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for incremental data synchronization, applied to a client, the method comprising:

2. The method according to claim 1, wherein the performing the block processing on the delta file based on the variable window and the tangent point to obtain a plurality of block data, and determining fingerprint data corresponding to each block data, includes:

3. The method of claim 1, wherein the summary information of the delta file comprises catalog summary information and data summary information of the delta file; the generating the first fingerprint metadata of the delta file based on the file data of the delta file and the fingerprint data includes:

4. A method according to claim 3, wherein prior to determining target delta data from the respective fingerprint data sets of the first fingerprint metadata and the second fingerprint metadata, the method further comprises:

5. The method of claim 1, wherein said determining target delta data from respective fingerprint data sets of said first fingerprint metadata and said second fingerprint metadata comprises:

6. The method of claim 1, wherein synchronizing the target delta data to a server comprises:

7. The method of claim 1, wherein after synchronizing the target delta data to a server, the method further comprises:

8. An incremental data synchronization device, the device comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.