CN103902735A

CN103902735A - Application perception data routing method oriented to large-scale cluster deduplication and system

Info

Publication number: CN103902735A
Application number: CN201410158590.0A
Authority: CN
Inventors: 付印金; 胡谷雨; 倪桂强; 谢钧
Original assignee: PLA University of Science and Technology
Current assignee: PLA University of Science and Technology
Priority date: 2014-04-18
Filing date: 2014-04-18
Publication date: 2014-07-02
Anticipated expiration: 2034-04-18
Also published as: CN103902735B

Abstract

The invention discloses an application perception data routing method oriented to large-scale cluster deduplication and a large-scale backup storage cluster system. The application perception data routing method comprises the steps of (S10) obtaining backup file meta-information, (S20) sensing a file application type, (S30) calculating deduplication storage node loads, (S40) selecting file routing nodes, (S50) sending files to target nodes, (S60) conducting deduplication on the files in the nodes, and the like. The large-scale backup storage cluster system comprises a plurality of backup clients, a backup server and a plurality of deduplication storage servers. The data routing method and the system have the advantages that the data deduplication rate is high, the node throughput rate is high, system communication overheads are low, and system loads are balanced.

Description

Towards large-scale cluster disappear heavy application perception data method for routing and system

Technical field

The invention belongs to information storage and cluster computing field, particularly a kind of towards large-scale cluster disappear heavy application perception data method for routing and extensive back-up storage group system.

Background technology

Data high redundancy in the backup storage system of numerous management mass datas, cluster (Cluster Deduplication) technology heavily of disappearing is that the data that realize distributed parallel on back-up storage server cluster system disappear and heavily process, and can manage the demand expanded in capacity and performance by satisfying magnanimity Backup Data.For building energy-saving and environmental protection, efficient green data center, cluster disappears and weighs the core technology that has become the management of current data central store.

For the consideration to system overhead, cluster disappears to weigh and often selects loosely coupled design, does not go the data of carrying out cross-node to disappear heavily.The data that backup client sends are first disappeared to each and are weighed storage server node by data route assignment, and the heavy storage server interior data content repeating of independent parallel ground deletion of node again disappears.Data route directly affects the system throughput of the storage space utilization factor of Backup Data, the heavy storage server node that disappears, load balancing and the communication overhead of the heavy storage server cluster that disappears.Therefore, data routing method is most important to the disappear lifting of heavy efficiency of cluster.

At present, the cluster heavy data routing method that disappears mainly contains three kinds: the piece DBMS method for routing based on distributed hashtable, the super piece DBMS method for routing based on status information and the file-level data routing method based on similarity.Piece DBMS method for routing based on distributed hashtable, 2009-02-23) and Chinese invention patent application " distributed data deduplication system and method thereof " (application number: 201110461322.2 as USENIX FAST ' 09 meeting paper " HYDRAstor:a Scalable Secondary Storage " (open day:, open day: 2011-12-28), be that data block characteristics value is assigned to the different pieces of information heavy node that disappears by distributed hashtable.Although the method can effectively improve space availability ratio and reduce communication overhead, can not retain the data locality in node and affects system throughput.Based on the super piece DBMS method for routing of status information, as USENIX FAST ' 11 meeting papers " Tradeoffs in Scalable Data Routing for Deduplication Clusters " (open day: 2011-02-14), continuous many data blocks after dividing are merged into even-grained super piece, before super piece route, all need to inquire about the repeat number of canned data piece in its contained data block and each node, then under the prerequisite of considering load balance, super piece is routed to the maximum node of repeating data piece number as far as possible.This strategy can obtain high data reduction rate under the prerequisite of load balance, but in the system communication expense of its broadcast type and node frequently piece fingerprint query manipulation had a strong impact on system performance.File-level data route based on similarity, as IEEE/ACM MASCOTS ' 09 meeting paper " Extreme Binning:Scalable, Parallel Deduplication for Chunk based File Backup " (open day: 2009-09-21), the minimum value of utilization based on data block fingerprint in Broder minimum value independence substitution theorem selecting file is as the similar features of file, by distributed hash mechanism, similar file is routed to the identical heavy storage server node that disappears, but in the time that in data stream, similarity is lower, can not detect document similarity, the cluster of Backup Data disappears, and heavily effect is poor.

In a word, the problem that prior art exists is: the cluster of hundreds and thousands of node scales of data center is disappeared heavily, and the defects such as heavily rate is low, node throughput is low, system communication expense is large and system load is unbalanced that exist data to disappear.

Summary of the invention

The object of the present invention is to provide a kind ofly towards large-scale cluster disappear heavy application perception data method for routing and system, there are data and disappear that heavily rate is high, node throughput is high, system communication expense is little and the feature of system load balancing.

The technical solution that realizes the object of the invention is: a kind of towards the large-scale cluster heavy application perception data method for routing that disappears, described extensive back-up storage group system comprises multiple backup client (100), a backup server (200) and multiple heavy storage server (300) that disappears, it is characterized in that, comprise the steps:

S10) obtain backup file metamessage: backup client (100) sends the file backup request message of the file metamessages such as title, user and the size of include file to backup server (200);

S20) perception file applications type: backup server (200) is divided the application type of backup file according to file metamessage, and inquire about application references structure, obtain the candidate that can deposit respective type application file heavy storage server (300) node listing that disappears;

S30) calculate the heavy memory node load that disappears: backup server (200) obtains respectively to disappear by inquiry application perception index structure and weighs the real-time dynamic load information of storage server (300) node, and calculate and can keep the low load of load balance weight storage server (300) node listing that disappears according to these node load information and backup file metamessage;

S40) selecting file routing node: backup server (200) is analyzed candidate's heavy storage server node listing and low load heavy storage server node listing that disappears that disappears, choose a low load candidate server node depositing same type application data as file route target node, and result is returned to backup client (100);

S50) Transmit message is to destination node: the file routing decision result that backup client (100) is returned according to backup server (200), sends to corresponding route target heavy storage server (300) node that disappears by each file in backup session;

S60) file disappears heavily in processing node: heavy storage server (300) node that disappears is according to the difference of application file data layout and content, independently dissimilar application file is carried out to data and disappears and heavily process.

A kind of for realizing towards the disappear extensive back-up storage group system of heavy application perception data method for routing of large-scale cluster, comprise multiple backup client (100), a backup server (200) and multiple heavy storage server (300) that disappears, it is characterized in that:

Described backup client (100) is for send the file backup request message of the file metamessage such as title, user and size of include file to backup server (200),

Backup server (200) is for according to the application type of file metamessage perception backup file, and inquires about application references structure, obtains the candidate that can deposit respective type application file heavy storage server (300) the node number list that disappears;

Backup server (200) is for obtain the real-time dynamic load information of heavy storage server (300) node that respectively disappears by inquiry application perception index structure, and calculates and can keep the low load of load balance heavy storage server (300) node listing that disappears according to these node load information and backup file metamessage;

Backup server (200) is for analyzing candidate's heavy storage server node listing and low load heavy storage server node listing that disappears that disappears, choose a low load both candidate nodes depositing same type application data as file route target node, and result is returned to backup client (100);

The file routing decision result that backup client (100) is returned according to backup server (200), sends to corresponding route target heavy storage server (300) node that disappears by each file in backup session;

Disappear heavy storage server (300) node for according to the difference of application file data layout and content, independently dissimilar application file is carried out to data and disappear and heavily process.

The present invention compared with prior art, its remarkable advantage:

1, data disappear, and heavily rate is high: the data routing policy by application perception weighs storage server node by similar data allocations to same disappearing, reduce the data overlap between each node, the same file disappearing in heavy storage server node is carried out to data independently by application and disappear and heavily process;

2, node throughput is high: based on file granularity distribute data, keep good data access locality;

3, system load balance: the actual physical storage capacity that disappears heavy storage server node according to each carrys out dynamic assignment storage resources, ensures the load balance of whole back-up storage group system;

4, communication overhead is low: judge data route to be applied as granularity, greatly reduced the message communicating expense of system.

In a word, the invention provides a kind of back-up storage group system of supporting hundreds and thousands of node scales and carry out the cluster heavy application perception data method for routing that disappears.The storage space that it not only can greatly save Backup Data uses, and can also optimize the heavy throughput of disappearing of the heavy storage server node that disappears, and reduces the communication overhead of group system inside, and keeps the load balance of each heavy storage server node that disappears.

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

Brief description of the drawings

Fig. 1 is the extensive back-up storage group system of the present invention structural representation.

Fig. 2 is the present invention towards the large-scale cluster heavy application perception data method for routing main flow chart that disappears.

Fig. 3 is perception file applications type schematic diagram.

Fig. 4 is selecting file routing node flow chart of steps in Fig. 2.

Embodiment

As shown in Figure 1, extensive back-up storage group system of the present invention, comprises multiple backup client 100, backup server 200 and multiple heavy storage server 300 that disappears;

Described backup client 100 is for sending the file backup request message of the file metamessage such as title, user and size of include file to backup server 200; The file routing decision result that backup client 100 is returned according to backup server 200, sends to corresponding route target heavy storage server 300 nodes that disappear by each file in backup session;

Described each backup client 100 comprises file I/O module 101 and backup request module 102, described backup request module 102 is for carrying out file backup session with described backup server 200, the file routing decision result of described file I/O module 101 for returning according to described backup server 200, by each file backup to the heavy storage server 300 that disappears accordingly;

Backup server 200 is for according to the application type of file metamessage perception backup file, and inquires about application references structure, obtains the candidate that can the deposit respective type application file heavy storage server 300 node number lists that disappear; Backup server 200 is the cores of realizing the inventive method.

Backup server 200 obtains the real-time dynamic load information of heavy storage server 300 nodes that respectively disappear by inquiry application perception index structure, and calculates and can keep the low load of load balance heavy storage server 300 node listings that disappear according to these node load information and backup file metamessage;

Backup server 200 is analyzed candidate's heavy storage server node listing and low load heavy storage server node listing that disappears that disappears, choose a low load both candidate nodes depositing same type application data as file route target node, and result is returned to backup client 100;

Described backup server 200 comprises backup session administration module 201, application perception module 202, file routing decision module 203 and load balance module 204, described backup session administration module 201 is for receiving the backup request of backup client 100, file is carried out to grouping management by the identical copy session from same user, and by file routing decision result feedback to backup client 100, described application perception module 202 is for classifying by application type to file, described load balance module 204 weighs the system load balancing of storage server cluster for keeping disappearing, described file routing decision module 203 is for being assigned to the application file of same type the heavy storage server node of disappearing of same low load, and file route target nodal information is fed back to backup client 100, and set up application file to the heavily mapping relations of storage server node that disappear, while recovery for file.

Disappear heavy storage server 300 nodes according to the difference of application file data layout and content, independently dissimilar application file is carried out to data and disappear heavily and to process.

The described heavy storage server 300 that disappears comprises data disappear heavy engine 3 01, file metadata administration module 302 and block management data module 303, described data disappear heavy engine 3 01 for heavily processing that backup file is disappeared, and according to the feature of different application, the file of every kind of application type is carried out to data independently to disappear heavily, described file metadata administration module 302 is for metadata and the piece fingerprint index information of the file of depositing on management node, and block management data module 303 disappears and weighs the rear unduplicated unique data piece of content for management.

As shown in Figure 2, the present invention is towards the large-scale cluster heavy application perception data method for routing that disappears, to disappear and weigh system architecture based on the extendible cluster of one, described extensive back-up storage group system as shown in Figure 1, comprises multiple backup client 100, backup server 200 and multiple heavy storage server 300 that disappears.

The present invention, towards the large-scale cluster heavy application perception data method for routing that disappears, comprises the steps:

S10) obtain backup file metamessage: backup client 100 sends the file backup request message of the file metamessages such as title, user and the size of include file to backup server 200.

S20) perception file applications type: the file metamessage that the application perception module 202 of backup server 200 obtains according to backup session administration module 201 is divided the application type of backup file, and inquire about application references structure, obtain the candidate that can deposit respective type application file heavy storage server 300 node listings that disappear.

Described perception file applications type (S20) step as shown in Figure 3, comprising:

S21) obtain file metamessage: backup server 200 obtains the file metamessage in backup request, comprise the file metamessage such as title 230, user 231 and size 232 of file, file name 230 comprises prefix and suffix, defines application type by suffix; If the prefix of Test.doc is Test, suffix is doc, and corresponding application type is the Word document of doc form.

S22) inquiry application references structure: the application type inquiry application references structure definite according to file name, comprises the index entries such as application type 233, node number 234 and data volume 235;

Wherein, application type 233 is filename suffix that backup file is corresponding, and node number 234 refers to the heavy storage server node number of disappearing of such application file of storage, and data volume 235 refers to the physical data amount of the similar application file being stored on same node.As in application references structure example with doc type matching be the first row and the third line content.

S23) obtain candidate's heavy storage server node number that disappears: from application references structure, find out and deposit the heavy storage server node number of disappearing of same application type file, and result is kept to the candidate heavy storage server node listing 236-LIST that disappears ₁in.As shown in Figure 3, discovery node 1 and node 2 are all deposited the application file of doc type.

S30) calculate the heavy memory node load that disappears: the load balance module 204 of backup server 200 is obtained the real-time dynamic load information of heavy storage server 300 nodes that respectively disappear by inquiry application perception index structure, and calculate and can keep the low load of the load balance heavy storage server 300 node listing LIST that disappear according to these node load information and backup file metamessage ₂.

Described calculating heavy memory node load (S30) step that disappears comprises:

S31) calculate the physical capacity that the heavy storage server node that disappears has used: the physical capacity C of the heavy storage server node i that disappears _i, can be expressed as:

wherein i=1,2 ..., N;

Wherein, N is the heavy storage server cluster server node number that disappears, and K is the application file species number of depositing in node i, C _ijfor depositing the corresponding physical capacity of application type j in the heavy storage server node i that disappears obtaining by inquiry application references structure;

S32) search the low load heavy storage server node that disappears: work as C _i+ S<T _itime, predicate node i is low load node, and node number i is dosed to LIST ₂in,

Wherein, T _ifor the load threshold of the heavy storage server node i that disappears, the size that S is backup file, LIST ₂for disappearing, low load weighs storage server node listing.

S40) selecting file routing node: the file routing decision module 203 of backup server 200 is analyzed the candidate heavy storage server node listing LIST that disappears ₁disappear and weigh storage server node listing LIST with low load ₂, choose a low load both candidate nodes depositing same type application data as file route target node, and result returned to backup client 100.

As shown in Figure 4, described selecting file routing node (S40) step comprises:

S41) candidate that input the has a same application file heavy storage server node listing LIST that disappears ₁disappear and weigh storage server node listing LIST with low load ₂;

S42) judge the common factor LIST of these two node listings ₁∩ LIST ₂be whether empty, go to step in this way S43, as otherwise forward step S46 to;

S43) judge the low load heavy storage server node listing LIST that disappears ₂be whether empty, go to step in this way S44, as otherwise go to step S45;

S44) send the heavily warning of storage server cluster load too high that disappears, end process process;

S45) disappear and weigh storage server node listing LIST from low load ₂in choose a node;

S46) from the both candidate nodes subset LIST of low load ₁∩ LIST ₂in choose one and return as destination node.

S50) Transmit message is to destination node: the file routing decision result that backup client 100 is returned according to backup server 200, sends to corresponding route target heavy storage server 300 nodes that disappear by each file in backup session.

S60) file disappears heavily in processing node: heavy storage server 300 nodes that disappear are according to the difference of application file data layout and content, independently dissimilar application file is carried out to data and disappears and heavily process.

The data of heavy storage server node 300 of disappearing disappear heavy engine 3 01 module according to the difference of application file data layout and content, independently dissimilar application file is carried out to the data re-optimization that disappears, and the physical capacity that the heavy rear file storage that disappears is increased is updated in the application references structure of backup server 200 as message feedback.File metadata administration module 302 and block management data module 303 respectively the metadata to the file of depositing on node (comprising piece fingerprint index information) and disappear heavy after the unduplicated unique data piece of content effectively manage.

The present invention optimizes cluster by Application and Development perception and disappears heavily and to process, and using provides a kind of and can take into account that back up memory space is saved and the data route technology of group system extended capability lifting.The present invention can be applied among network backup software, distributed file system and cloud storage system software, easily realizes high efficiency parallel data and disappears heavily and to process.

Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. one kind towards the large-scale cluster heavy application perception data method for routing that disappears, described method is implemented in extensive back-up storage group system, comprise multiple backup client (100), a backup server (200) and multiple heavy storage server (300) that disappears, it is characterized in that, comprise the steps:

S40) selecting file routing node: backup server (200) is analyzed candidate's heavy storage server node listing and low load heavy storage server node listing that disappears that disappears, choose a low load both candidate nodes depositing same type application data as file route target node, and result is returned to backup client (100);

2. application perception data method for routing according to claim 1, is characterized in that, described perception file applications type (S20) step comprises:

S21) obtain file metamessage: backup server (200) obtains the file metamessage in backup request, comprise title, user and the size of file, file name comprises prefix and suffix, defines application type by suffix;

S22) inquiry application references structure: the application type inquiry application references structure definite according to file name, application references comprises the index entries such as application type, node number and data volume;

S23) obtain candidate's heavy storage server node number that disappears: from application references structure, find out and deposit the heavy storage server node number of disappearing of same application type file, and result is saved in to candidate's heavy storage server node listing that disappears.

3. application perception data method for routing according to claim 1, is characterized in that, described calculating heavy memory node load (S30) step that disappears comprises:

S31) calculate the physical capacity that the heavy storage server node that disappears has used: the physical capacity C of the heavy storage server node i that disappears _ican be expressed as,

wherein i=1,2 ..., N;

S32) search the low load heavy storage server node that disappears: work as C _i+ S<T _itime, predicate node i is low load node, node number i is dosed to low load and disappear in heavy storage server node listing;

Wherein, T _ifor the load threshold of the heavy storage server node i that disappears, the size that S is backup file.

4. application perception data method for routing according to claim 1, is characterized in that, described selecting file routing node (S40) step comprises:

S42) judge the common factor LIST1 ∩ LIST of these two node listings ₂be whether empty, go to step in this way S43, as otherwise forward step S46 to;

5. one kind for realizing the extensive back-up storage group system of application perception data method for routing claimed in claim 1, comprise multiple backup client (100), a backup server (200) and multiple heavy storage server (300) that disappears, it is characterized in that:

Described backup client (100) is for the file backup request message of the file metamessages such as the title to backup server (200) transmission include file, user and size;

6. extensive back-up storage group system according to claim 5, is characterized in that:

Described each backup client (100) comprises file I/O module (101) and backup request module (102), described backup request module (102) is for carrying out file backup session with described backup server (200), the file routing decision result of described file I/O module (101) for returning according to described backup server (200), weighs each file backup storage server (300) to disappearing accordingly;

Described backup server (200) comprises backup session administration module (201), application perception module (202), file routing decision module (203) and load balance module (204), described backup session administration module (201) is for receiving the backup request of backup client (100), file is carried out to grouping management by the identical copy session from same user, and by file routing decision result feedback to backup client (100), described application perception module (202) is for classifying by application type to file, described load balance module (204) weighs the system load balancing of storage server cluster for keeping disappearing, described file routing decision module (203) is for being assigned to the application file of same type the heavy storage server node of disappearing of same low load, and file route target nodal information is fed back to backup client (100), and set up application file to the heavily mapping relations of storage server node that disappear, while recovery for file.

The described heavy storage server (300) that disappears comprises data disappear heavy engine (301), file metadata administration module (302) and block management data module (303), described data disappear heavy engine (301) for heavily processing that backup file is disappeared, and according to the feature of different application, the file of every kind of application type is carried out to data independently to disappear heavily, described file metadata administration module (302) is for metadata and the piece fingerprint index information of the file of depositing on management node, and block management data module (303) disappears and weighs the rear unduplicated unique data piece of content for management.