CN101183323A

CN101183323A - Data stand-by system based on finger print

Info

Publication number: CN101183323A
Application number: CNA2007101687158A
Authority: CN
Inventors: 冯丹; 刘景宁; 杨天明; 周可; 牛中盈; 张航; 刘高
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2007-12-10
Filing date: 2007-12-10
Publication date: 2008-05-21
Anticipated expiration: 2027-12-10
Also published as: CN100547555C

Abstract

The invention relates to a data backup system based on fingerprints, belonging to the technical field of computer storage backup, which aims at reducing management, storage and network overhead of data backup and improving backup performance. The invention comprises a backup server, a backup agent, a storage server and a Web server which complete data backup and recovery through mutual network communication. The invention is characterized in that redundant data of backup files is recognized with the file segmentation technology based on anchors, thus the invention has the advantages that the modification stability is good and computation cost is low; data segmentations with fingerprints as the index are stored on a plurality of disk arrays of the storage server so as to eliminate backup of redundant data and save disk storage space; the data segmentations are not erased once stored and can be continuously appended on the disk so as to eliminate disk storage fragments; since the effective backup buffer strategy is adopted, the invention also has the advantages of reduced network overhead of backup, increased data backup speed and lowered backup influence to application servers.

Description

A kind of data backup system based on fingerprint

Technical field

The invention belongs to Computer Storage backup field, be specifically related to a kind of data backup system.

Background technology

In the information age of current this knowledge explosion, no matter concerning enterprise still the individual, data all are precious resources.Loss of data gently then influences the business event continuity, makes it lose competitive edge for the moment, heavy business failure is closed down.The reason that causes loss of data is a lot, comprises system hardware and software fault, human operational error or destruction and irresistible power (disaster, war) etc.For protected data exempts from accident, traditional method is periodically to copy data to movably media such as on tape, the CD, and then off-line is transported to a comparatively safe place so that recover these data where necessary.Should be understood that; there are some significant disadvantages in this traditional data guard method: (1), movably storage medium can occur wearing and tearing or damages the long-term storage media that makes its memory reliability reduce thereby be not suitable for doing data such as tape, CD etc. as time goes by.(2), the tape of the storage medium commonly used of conduct backup Large Volume Data; its read or write speed is often very slow, and owing to be sequential storage device, can occur frequent mechanical rewinding operation usually when restore data; if Backup Data is distributed on several tapes, the charge and discharge operations when also costing.This makes that utilizing tape to carry out data backup and recover is a job quite consuming time.(3), need employ the special messenger Backup Data is transported to remote site, and guarantee data security in transportation and the storage process.This shows that traditional data backup needs artificial the intervention to finish many tasks, is of a high price, a loaded down with trivial details job.In order to improve the efficient of data backup and recovery, overcome the shortcoming of traditional data protection technology, recent two decades comes, and some well-known IT enterprises or research institution have developed panoramic data backup system in the world.The TotalStorage that comprises IBM, OpenView memory image software, CASA, XPCA and the EVACA of HP, the SRDF of EMC and MirrorView, NetBackup of VERITAS or the like.These business systems do not have the data de-duplication function, in order to be stored in the mass of redundancy data that produces in the backup, often need to use disk to tape (D2T) technology, promptly use hyperdisk as backup buffers to improve online backup efficient, on the backstage Backup Data in the disk buffer is moved on the jumbo storage mediums of low speed such as tape library or CD server then, carry out daily servicing so its backstage memory device still needs to expend lot of manpower and material resources.Because disk storage has advantages such as convenient management, access speed be fast than tape storage, along with the development of disk storage technology, uses the standby system of disk storage data more and more to come into one's own.Present disk storage technology can be easy to build the disk storage system of a TB even PB level.The price of every bit disk storage more and more cheaply makes utilizes the permanent filing data of the disk reality that becomes.For a data backup system based on disk; the Backup Data permanent storage is not wiped in disk has many advantages: at first; data can be write on the disk continuously; can not produce disk fragments because of space reclamation, secondly, user's data history obtains complete preservation; the user is arbitrary old version of browser document easily; the 3rd, help protecting user's Backup Data, avoid user misoperation and deleted important data.Yet for the standby system based on disk of a permanent storage, maximum challenge derives from the ever-increasing Backup Data of user.Usually, the data of enterprise have the redundancy of height, and also there are a large amount of repeated content in a large amount of data that repeat and file storage between a plurality of edit versions of a file in system.Widely used redundancy technique based on file can not be discerned the redundant data between the file at present, cause increasing repeating data to backup in the system, not only reduced the disk space utilization factor of standby system, and for no reason by Network Transmission mass of redundancy data, increase the network overhead of data backup, prolonged the data backup time.

This shows, the standby system based on disk of a permanent storage of exploitation, and adopt new data backup technology to remove the redundant data of backup, and improve the storage efficiency of system, have positive effect.

Summary of the invention

The present invention proposes a kind of data backup system based on fingerprint, system adopts disk permanent storage Backup Data and adopts data backup technology based on fingerprint with the redundant data in the deletion backup, purpose is to reduce management, storage and the network overhead of data backup, improves backup performance.

A kind of data backup system based on fingerprint of the present invention comprises backup server, backup agent, storage server and Web server, and they intercom mutually by network and finish data backup and recovery, it is characterized in that:

Described backup server is equipped with configuration file and catalog data base, the manipulating object of recording user definition in the configuration file of backup server, manipulating object comprises the attribute of appointing system operation task operation, and backup server is being controlled the process of whole data backup and recovery by manipulating object; Catalog data base storage operation record, charge book is preserved the management information of manipulating object operation;

Described backup agent unit is installed on that each needs on the main frame of Backup Data in the network, from the file system of place main frame, read the file that needs backup by the backup agent unit during backup, file is carried out based on the piecemeal of anchor and calculates the fingerprint of piecemeal, and the block data that fingerprint and part are needed is sent to storage server by network; The backup agent unit receives file data and writes the file system of place main frame under the designated directory from storage server by network during recovery;

Described storage server is equipped with the large capacity disc array, the large capacity disc array is the destination of data backup, receive fingerprint or deblocking from corresponding backup agent unit by network during backup, deblocking is stored on the disk, and set up the index of file; During recovery then from the large capacity disc array according to file index reconstruct file, and file data delivered to corresponding backup agent by network;

Described Web server is the B-S pattern webpage subscriber administration interface of native system, by the login Web server, the user both can appointing system finishes the ruuning situation of interactively back up or restore operation, the operation of surveillance scheduling type automatically, can also revise configuration file, the customization manipulating object of backup server, carry out equipment control.

Described data backup system based on fingerprint is characterized in that, described backup server comprises backup server initialization module, order monitoring module, command processing module, operation processing module and network communication module;

Described backup server initialization module is carried out initial work, comprises reading configuration file, set up resource chained list in the internal memory, check catalog data base state, the data consistency that guarantees configuration file and catalog data base and integrality, startup command policing port, accepting user command, initialization job queue and user command formation, load operations object, initiating task and network monitoring service in job queue from Web server;

It is a network monitoring thread that is generated by system that module is monitored in described order, connection request to Web server authenticates, assurance has only the Web server ability connected system through system authorization, monitors the command request of having sent by the Web server that authenticates; Receive orders when asking, command request is joined in the user command formation wait for system handles;

Described command processing module comprises a user command formation and N command job thread, and when the user command formation was overflowed, order was monitored module and changed sleep state over to; Constantly reading order and the execution from the user command formation of these command job threads finished different functions according to the difference of performed order; When order is monitored module adds an order in the user command formation,, just generate a new command job thread if when the number of current command job thread that does not have a free time and active command job thread does not reach N; The command job thread all checks from the user command formation that at every turn order monitors the state of module during reading order, if it is in sleep state then wakes it up;

Described operation processing module comprises that a job queue, a L work operations thread and a job queue load thread, and when the operation formation was overflowed, job queue loaded thread and enters sleep state; The work operations thread is constantly got manipulating object and is carried out from job queue, call different resources, realize different functions according to the difference of manipulating object attribute; Job queue loads thread and carries out job scheduling, check the scheduling strategy attribute of each manipulating object in the operation resource chain, the manipulating object that needs management and running is added in the job queue, if when the current number that does not have idle work operations thread and an active work operations thread does not reach L, just generate a new work operations thread; The work operations thread all checks from job queue that at every turn job queue loads the state of thread during the reading operation object, if it is in sleep state then wakes it up;

Described network communication module encapsulates the network communication applications DLL (dynamic link library) of standard, provide network communication interface to command job thread and work operations thread, network communication interface is realized the Data Transport Protocol between backup server, backup agent and the storage server.

Described data backup system based on fingerprint is characterized in that, described backup agent comprises backup agent initialization module, request monitoring module, operation processing module, file block module and network communication module;

Described backup agent initialization module is carried out initial work, comprises reading the backup agent configuration file, setting up the memory source chained list, the initialization job queue, start backup server requests and monitor module;

Described request is monitored the connection request that module is monitored backup server on the network, authenticates the backup server of connection, and authentication is communicated by letter with this backup server by a network connection of back generation socket and added in the job queue;

Described operation processing module comprises a job queue and M work operations thread, and when the operation formation was overflowed, request monitoring module changed sleep state over to; The work operations thread takes out a network connection socket from job queue after, at first set up a job control record for operation, network is connected the socket chain goes in the member variable of job control record, connect socket by this network then and backup server mutual, the relevant attribute of backup server manipulating object by conversion after assignment to the corresponding member variable of job control record; Use the operation bill ticket that obtains from backup server to connect corresponding storage server then, produce a network and be connected socket and it chain is gone in the member variable of job control record with storage server communication; When request monitoring module adds network when connecting socket in job queue,, just generate a new work operations thread if when the current number that does not have idle work operations thread and an active work operations thread does not reach M; The work operations thread is got the state of all checking request monitoring module when a network connects socket at every turn from job queue, if it is in sleep state then wakes it up;

Described file block module is accepted the file block task of the command execution backup job of work operations thread in the operation processing module, each file that on client file systems, opens file concentrated, file is carried out based on the piecemeal of anchor and calculates the piecemeal fingerprint and corresponding storage server coordinates to carry out the backup algorithm of first backup procedure;

Described network communication module is made up of the network connection socket of operation, and each operation of backup agent all has two networks and connects sockets, is respectively applied for the backup server operation and the storage server operation of this operation correspondence and communicates by letter.

Described data backup system based on fingerprint, it is characterized in that, described storage server comprises the storage server initialization module, connects monitoring module, operation bill table, operation processing module and network communication module, and index buffer zone, blocking and buffering district, piecemeal Hash table and Disk Logs;

Described storage server initialization module is carried out initial work, comprises resolving the storage server configuration file, sets up the memory source chained list, starts the related service thread;

The connection request of described connection monitoring module monitoring backup server and backup agent authenticates the backup server that connects, and authentication generates a network by the back and connects socket and communicate by letter with this backup server and add in the job queue; To the backup agent that connects, then check operation bill table so that it is authenticated according to its operation bill ticket that shows, authentication is communicated by letter with this backup agent by a network connection of back generation socket and is linked in the member variable of corresponding job control record;

Described operation bill table is used to store the bill that operation authenticates to backup agent;

Described operation processing module comprises a job queue and W work operations thread, when the operation formation is overflowed, connects monitoring module and changes " refusal backup server connection request " state over to; The work operations thread takes out a network connection socket from job queue after, at first set up a job control record for operation, network is connected the socket chain goes in the member variable of job control record, mutual by this network connection socket and backup server then, the relevant attribute of backup server manipulating object by conversion after assignment give the corresponding member variable of job control record, and generate an operation bill ticket at random and register in the operation bill table and and transmit this operation bill ticket to the backup server manipulating object; In job queue, adds a network when connecting socket when connecting monitoring module,, just generate a new work operations thread if when the number of current work operations thread that does not have a free time and active work operations thread does not reach W; The work operations thread is got from job queue at every turn and is all checked the state that connects monitoring module when a network connects socket, if it is in " refusal backup server connection request " state then cancels this state so that it accepts the backup server connection request;

Described network communication module is made up of the network connection socket of operation, and each operation of storage server all has two networks and connects sockets, is respectively applied for the backup server operation and the backup agent operation of this operation correspondence and communicates by letter;

Described index buffer zone is the infrastructure that first backup procedure and second backup procedure are carried out in the storage server operation, and the index buffer zone is realized with an internal memory Hash table, is used for storing this job instances of this activity chain Job _x(t _n) previous job instances Job _x(t _N-1) all fingerprints that comprise and newly-generated fingerprint in this job run process;

Described blocking and buffering district is the infrastructure that first backup procedure and second backup procedure are carried out in the storage server operation, the blocking and buffering district does not have found deblocking with an independently disk array realization in order to its fingerprint in interim storage first backup procedure in the index buffer zone;

Described piecemeal Hash table is the infrastructure that second backup procedure is carried out in the storage server operation, and the piecemeal Hash table is with an independently disk array realization, in order to set up the piecemeal fingerprint to the mapping of this piecemeal in the memory address of Disk Logs;

Described Disk Logs is the infrastructure that second backup procedure is carried out in the storage server operation, and Disk Logs is with an independently disk array realization, in order to the file index of storing deblocking and storing with block form.

Advantage of the present invention is:

1, adopt file block technology file to be divided into the piece of elongated size with the redundant data between identification file inside or the file based on anchor, has the stability of modification, a file modifying is only influenced data block adjacent in the modifier area, and the border of other data blocks can not be moved.When a file was carried out incremental backup, several data blocks of only revising needed backup like this, and other data block can be shared with former backup file; Use window to slide and calculate, computing cost is little.

2, deblocking is that index stores is on the disk array of storage server with its fingerprint, address data memory and relevance are got up, change the traditional concept that address data memory and content are separated, eliminated the backup of redundant data, saved disk storage space;

3, deblocking is in case storage is just no longer wiped, and deblocking can append on disk continuously, has eliminated the disk storage fragment; User's data history obtains complete preservation, and the user is arbitrary old version of browser document easily; Avoid user misoperation and deleted significant data.

4, adopt effectively backup buffering strategy, reduced the network overhead of backup, improved data backup speed, reduced the influence of backup application server.

Description of drawings

Fig. 1 is a structural representation of the present invention;

Fig. 2 is the backup server structural representation;

Fig. 3 is the backup agent structural representation;

Fig. 4 is the storage server structural representation;

Fig. 5 is the storage synoptic diagram of file on Disk Logs;

Fig. 6 is a plurality of file-sharing deblockings/index block synoptic diagram on the Disk Logs;

Fig. 7 is an index buffer zone structural drawing of the present invention;

Fig. 8 is in the file block technology based on anchor, the file block synoptic diagram.

Embodiment

The present invention is described in more detail below in conjunction with drawings and Examples.

1, system global structure

Fig. 1 is a system of systems synoptic diagram of the present invention, the present invention includes backup server, backup agent, storage server and Web server, and they intercom mutually by network and finish data backup and recovery.

Fig. 2 is the backup server structural representation; Backup server comprises backup server initialization module, order monitoring module, command processing module, operation processing module and network communication module; Configuration file and catalog data base also are housed.

Backup server is commander's maincenter of whole network backup system, and it is controlling the process of whole data backup and recovery by manipulating object.The manipulating object of backup server provides the window of a customization backup/resume operation to the user.Manipulating object has comprised many attributes, and these attributes have been specified system, and how operation task moves.Operation backup/restoration data from which platform main frame have been specified as the backup agent attribute; The file set attribute has specified operation to want the catalogue of backup/restoration; The scheduling strategy attribute has been specified strategy of this job run of system call or the like.Remember that a manipulating object is Job _x, manipulating object produces a running example Job when moment t is scheduled operation _x(t).Manipulating object Job _xA chronological sequence running example Job _x(t ₀), Job _x(t ₁) ... Job _x(t _n) (t ₀＜t ₁＜...＜t _n) formed the activity chain of this manipulating object, be designated as Job _x(t ₀, t ₁... t _n).Described backup server is safeguarding that simultaneously a catalog data base is used to write down Job _x(t) management information.Specifically, Job _x(t) management information is stored in the charge book Job of this operation in the catalog data base _x(t) among the .Record.

Catalog data base: be used for the management information of storage operation operation, i.e. Job _x(t) .Record.Job _x(t) .Record mainly stores the root piece of the file that this operation comprises, the file fingerprint Job of this operation _x(t) .FF etc.The operation Job that each operation is finished _x(t) all in catalog data base, preserve a file fingerprint Job _x(t) .FF, Job _x(t) .FF storage operation Job _x(t) all fingerprints that comprised.Job _x(t _n) .FF is used for operation Job _x(t _N+1) the index buffer zone carry out initialization.

Fig. 3 is the backup agent structural representation; Backup agent comprises backup agent initialization module, request monitoring module, operation processing module, file block module and network communication module.

Fig. 4 is the storage server structural representation; Storage server comprises the storage server initialization module, connects monitoring module, operation bill table, operation processing module and network communication module, and index buffer zone, blocking and buffering district, piecemeal Hash table and Disk Logs.

The stores service management a jumbo disk array (RAID) in order to the storage deblocking.Piecemeal is that index stores is on disk array with its fingerprint.In a single day deblocking is write on the disk and is just no longer wiped, and the whole magnetic disk array is just as a daily record like this, and deblocking does not have the compartment of terrain and appends on disk, has eliminated the fragment of disk storage.The disk that is used to store deblocking is called as Disk Logs.Storage server uses the disk array memory partitioning Hash table of a special use, and the piecemeal Hash table is in order to set up the piecemeal fingerprint to the mapping of this piecemeal in the memory address of Disk Logs.All deblockings of backup file carry out index by index block, and all index blocks of a file have been formed an index tree.Each file all has a unique piecemeal the root piece simultaneously, the index of the root of the index tree of root piece storage file, and the metadata and some management information of file also are stored on the root piece simultaneously.The root piece and the index block of file also are stored on the Disk Logs as deblocking.Storage server adopts the backup buffering strategy to improve the data backup speed of system.Be specially: (1) adopts this job instances Job in this activity chain of internal memory index buffer stores _x(t _n) previous job instances Job _x(t _N-1) all fingerprints that comprise and newly-generated fingerprint in this job run process.(2) adopt the disk array of a special use in the index buffer zone, not have found deblocking in order to its fingerprint in the interim storage backup process as the blocking and buffering district.The backup procedure of (3) operations is divided into two stages to be finished, and these two stages are designated as first backup procedure and second backup procedure respectively.First backup procedure is finished the backup of file block mutually alternately by backup agent and storage server, makes the index of reference buffer zone search the piecemeal fingerprint, uses the blocking and buffering district to be stored in the deblocking of not finding its fingerprint in the index buffer zone search procedure.Concerning backup agent, even if the backup procedure of operation was through with after first backup procedure was finished.Because this process uses internal memory index buffer zone to carry out the fingerprint inquiry, removed time-consuming piecemeal Hash table inquiry from, so speed is very fast.Second backup procedure is by storage server operation when system is idle relatively.This process dumps to the deblocking of interim storage in the blocking and buffering district on the Disk Logs, uses the piecemeal Hash table to carry out the fingerprint inquiry.This process is set up the index tree of file on Disk Logs simultaneously.Because second backup procedure is to be finished alone by storage server on the backstage, so the application server of operation backup agent is not influenced.During recovery file, storage server is delivered to corresponding backup agent according to file index reconstruct file and file data by network.

Web server: the present invention adopts the B-S pattern that web user interface is provided.The user can be anywhere administration interface by the Web browser login system finish the ruuning situation of interactively back up or restore operation, the operation of surveillance scheduling type automatically with appointing system, can also customize operation, configuration backup server, carry out equipment control etc.

2, storage server Disk Logs

Backup Data piecemeal of the present invention is that index stores is on the Disk Logs of storage server with its fingerprint.Guaranteeing does not like this have two identical piecemeals to be stored on the disk simultaneously, thereby has eliminated the backup of redundant data.Piecemeal makes appending on Disk Logs that piecemeal can be continuous in case storage is just no longer wiped, and has eliminated the disk storage fragment.Data block under the backup file is index with the index block.The index block of file also is stored on the Disk Logs.

2.1, the piecemeal build

For the aspect management, the front of each deblocking has all added a build.Build is system management, comprises that the reconstruct of integrity detection, file index and piecemeal Hash table provides necessary information.Build is 39 bytes altogether, by forming with the lower part:

The build sign of magic:6 character;

Fingerprint: the fingerprint of this piecemeal, totally 20 bytes;

Type: the type of notebook data piecemeal, have three kinds of data of different types piecemeals, promptly the root piece of data block, index block and file is designated as respectively: dc, ic, rc;

Size: the size of notebook data piecemeal does not comprise build.To index block, system stipulates that its size can not surpass 16KB;

Offset: the memory address of notebook data piecemeal on Disk Logs.

2.2, file index

Figure 5 shows that the storage organization of file on Disk Logs.Data block under the file is index with the index block, and index block also is stored on the Disk Logs, and all index blocks of a file have been formed an index tree; Each file all stores a unique root piece on Disk Logs, the metadata of storage file and some management information of presents gone back simultaneously in the index of the root of storage file index tree in the root piece.After file backup was finished, its root piece also stored in the charge book of catalog data base as the management information of operation simultaneously.Among Fig. 5, F ₀Represent a file, D _iThe expression data block, I _iThe expression index block, index block is made up of index entry, P (X) represents an index entry, and it is a tlv triple＜H (X), offset, type 〉, wherein X is indexed deblocking, the fingerprint of H (X) expression deblocking X, and offset represents the memory address of deblocking X on Disk Logs, type represents the type of deblocking X, and X can be an index block I _i, also can be a data block D _i, the arrow among the figure is represented the corresponding relation of indexed and its index entry, M (F ₀) expression file F ₀Metadata and some management information, index block I ₀, I ₁And I ₂Formed file F ₀Index tree, index block I ₀The root of index tree for this reason, R ₀Expression file F ₀The root piece, it is by M (F ₀) and a root I who points to the index tree of file ₀Index entry P (I ₀) form.All data blocks on the Disk Logs can be shared by different files with index block.Figure 6 shows that the situation of different file-sharing data blocks and index block, the meaning that each mark is represented among the figure is identical with Fig. 5.

3, storage server piecemeal Hash table

Storage server piecemeal Hash table of the present invention is in order to set up the piecemeal fingerprint to the mapping of this piecemeal in the memory address of Disk Logs, and the piecemeal Hash table is made up of the bucket of identical size.The barrelage that the piecemeal Hash table is comprised is to determine according to the size of Disk Logs, and the capacity of Disk Logs is big more, and then the barrelage that comprised of piecemeal Hash table is just many more, with the probability of the hash-collision that reduces bucket.System is mapped to fingerprint in the corresponding bucket of Hash table as barrel number according to the preceding n position of the barrelage print of Hash table.Each fingerprint is with tlv triple＜fingerprint, offset, type〉form be stored in the bucket, wherein fingerprint represents the fingerprint of this piecemeal, offset represents the memory address of piecemeal on Disk Logs of this fingerprint correspondence, and type represents the type of the piecemeal of this fingerprint correspondence.If hash-collision takes place in bucket, then the triple store of fingerprint in an adjacent bucket.

4, storage server index buffer zone

Figure 7 shows that the structure of index buffer zone.The index buffer zone is an internal memory Hash table, it is made up of a bucket group and many data link tables, bucket group one total 1024*1024 bucket, the numbering of bucket is from 00000H to FFFFFH, bucket may be sky, the bucket if not empty, and then the inside comprises a pointer that points to data link table, corresponding data chained list, the list item storage of data link table is hashing onto the finger print information in this barrel.During Hash, preceding 20 bits of print are hashing onto this fingerprint in the corresponding bucket data link table pointed as barrel number.

The list item structure of data link table is:

Tag: identifier accounts for 4 bits, in order to the state of indication this fingerprint in first backup procedure and second backup procedure;

FingerprintTail: back 140 bits of the fingerprint of this piecemeal, because preceding 20 bits lie in the barrel number, so only need back 140 bits of storage fingerprint here;

Offset: memory address accounts for 64 bits, if this non-NULL is then represented the memory address of the deblocking of this fingerprint correspondence at Disk Logs;

Next: account for 32 bits, point to the pointer of next list item.

" fingerprint " is depicted as the situation that a fingerprint 7E54F36A4EC62...3B is hashing onto the index buffer zone among Fig. 7, preceding 20 bits " 7E54F " of (1) step with fingerprint find the bucket that is numbered 7E54FH as barrel number (bucketNo), (2) step was looked for fingerprintTail in the data link table of this barrel indication be the list item of " 36A4EC62...3B ", if find then show that fingerprint 7E54F36A4EC62...3B has been stored in the index buffer zone, if do not find, then set up the information that a new list item is stored this fingerprint.

The tag of the data link table list item of index buffer zone has three different numerical value, and the meaning of its expression is as follows:

0000: fingerprint derives from the file fingerprint of previous operation, and is not hit in this backup procedure;

1000: fingerprint derives from the file fingerprint of previous operation, and is hit in this backup procedure;

1100: fingerprint is new the generation in this backup procedure.

A backup job Jobx (t _N-1) finish after, all fingerprints that this operation comprised are with two tuples＜fingerprint, offset〉form of (wherein fingerprint represents the fingerprint of piecemeal, and offset represents the memory address of piecemeal on Disk Logs) is stored in file Jobx (t _N-1) among the .FF, file Jobx (t _N-1) .FF is stored in the charge book Jobx (t of catalog data base _N-1) among the .Record.Jobx (t _N-1) .FF is used to initialization operation Jobx (t _n) the index buffer zone.Because a large amount of file or data are shared in the adjacent operation of same activity chain usually, use Jobx (t _N-1) .FF initialization operation Jobx (t _n) the index buffer zone can improve the fingerprint hit rate of buffer zone.

5, backup procedure

For simplicity, be defined as follows mark:

BS: backup server work operations thread;

BA: backup agent work operations thread;

SS: storage server work operations thread;

F a: file;

H a: fingerprint;

M (F): the metadata of file F;

R (F): the root piece of file F;

H (D): the fingerprint of deblocking D;

D (H): the pairing data block/index block of fingerprint H;

F.Index: the core buffer that makes up the index tree of file F;

Index cache: index buffer zone;

Chunk cache: blocking and buffering district;

Hash table: piecemeal Hash table;

Job _x(t _n) .FileSet: manipulating object Job _x(t _n) file set;

I (F, level): the set of the index block that index tree F.Index level layer comprises.The leaf of index tree is defined by 0 layer, and the father node of leafy node is the 1st layer of tree, and the like.

I _w(F, level): I (F currently in level) is used to store tlv triple＜H, offset, type〉the work node;

＜H, offset, type 〉: tlv triple, H: fingerprint, offset: the memory address of piecemeal D (H) on Disk Logs, type: the type of piecemeal D (H);

5.1, first backup procedure

First backup procedure is mainly finished by backup agent work operations thread and the cooperation of storage server work operations thread, the steps include:

(1) SS: use Job _x(t _N-1) .FF initialization index cache;

(2) BA:if (Job _x(t _n) .FileSet is empty) changeing (20), else is from Job _x(t _n) read a file F among the .FileSet _i

(3) BA: transmit M (F _i) to SS;

(4) SS: M (F _i) be cached to chunk cache;

(5) BA: to F _iCarry out file block based on anchor;

(6) BA: calculate the fingerprint of each piecemeal and the fingerprint set that these fingerprints are formed is sent to SS;

(7) SS:if (the fingerprint set is for empty) changes (17), and else takes out a fingerprint H in the fingerprint set _jAnd in index cache the inquiry this fingerprint;

(8) SS:if (finds fingerprint H at index cache _j)

(9) SS:if (tag==0000) { tag=1000;＜H _j, offset〉and be cached to chunkcache; }

(10) SS:else if (tag==1000) is＜H _j, offset〉and be cached to chunkcache;

(11) SS:else if (tag==1100) is＜H _j, null〉and be cached to chunk cache; }

(12) SS:else{ is H _jBe cached to index cache, tag=1100, offset=null;

(13) SS: request BA transmits D (H _j);

(14) BA: transmit D (H _j) to SS;

(15) SS:＜H _K, D (H _K) be cached to chunk cache; }

(16) SS: return step (7);

(17) SS: notice BA backs up next file;

(18) BA: return step (2);

(19) BA: to BS and SS report operation Job _x(t _n) done state withdraw from then.

(20) SS: after receiving the end of job signal of BA, finish first backup procedure, change second backup procedure over to;

(21) BS: after receiving the end of job signal of BA, disconnection and BA are connected, and wait for that SS carries out second backup procedure.

5.1.1 file block based on anchor

In the step (5) of first backup procedure, finish by backup agent work operations thread dispatching backup agent file block module based on the file block of anchor, the steps include:

(1) with the beginning 48 byte b of file ₁, b ₂..., b ₄₈Be a window, with formula H ₁=(b ₁* p ⁴⁷+ b ₂* p ⁴⁶+ ...+b ₄₈) cryptographic hash of first window of mod M calculation document.P is certain prime number in the following formula, and is desirable 17, and M is a constant, desirable 2 ³²Cryptographic hash is stored in variable H ₁In.

(2) slide backward a byte, with formula H ₂=(p*H ₁+ b ⁴⁹-b ₁* p ⁴⁸) second window b of mod M calculation document ₂, b ₃..., b ₄₉Cryptographic hash be stored in variable H ₂In.

(3) by that analogy, the cryptographic hash of all windows of calculation document.

(4), get its low 13 and form a binary number, if certain number (such as 61) that this number equals to be scheduled to determines that then its corresponding window is an anchor to the cryptographic hash of each window.

(5) be that the border is divided into data block not of uniform size to file with the anchor.

Above-mentioned file block based on anchor is observed following three agreements: if a) file is less than 48 bytes, then withdraw from the file block algorithm based on anchor, whole file is a data block; B), then give up some anchors and make minimum piecemeal be not less than 2KB (piecemeal of end of file be unique may less than the piecemeal of 2KB) if in a certain section byte stream, comprise too much anchor; C) if all do not have anchor in the byte stream of continuous 64KB, then getting this 64KB is a piecemeal;

File block based on anchor among the present invention has following two characteristics: (1) has the stability of modification, and that is to say only influences data block adjacent in the modifier area to a file modifying, and the border of other data blocks can not be moved.When a file was carried out incremental backup, several data blocks of only revising needed backup like this, and other data block can be shared with former backup file.Revise stability and guaranteed that also the data similarity between file inside and the file is not omitted because of bit offset, thereby detect the repeating data of file to greatest extent.(2) moving window has the advantage of convenience of calculation, the cryptographic hash of its next window can be easy to calculate from the basis of the cryptographic hash of previous window, thereby make the file block based on anchor have the little advantage of computing cost, the time complexity of whole algorithm is O (n), and wherein n is the byte number that file comprises.

The situation of change of this file block when Figure 8 shows that behind the file block again to the file editor.As can be seen from the figure, have the stability of modification based on the file block of anchor, that is to say only influences data block adjacent in the modifier area to a file modifying, and the border of other data blocks can not be moved.A is capable to be depicted as a file and to be divided into B by anchor ₁～B ₈8 not of uniform size, the part of the boundary strip line tooth of each piece is the anchor of 48 bytes.After b, c, d behavior are carried out revising for the 1st, 2,3 time to file, the situation of change of piecemeal, the part of band shade is the part that was modified.B is capable: the 1st modification to file occurs in piece B ₄In, do not produce new piece after the modification, only make piece B ₄Become piece B ₉, other piece does not all change.File backup at this time just only need be piece B ₉Backup substitutes original piece B in the past ₄That's all.C is capable: the 2nd modification to file occurs in piece B ₅In, produced new anchor after the modification, piece B ₅Two B have been divided into ₁₀And B ₁₁, other piece does not all change.File backup at this time just only need be piece B ₁₀And B ₁₁Backup replaces original piece B in the past ₅Just.D is capable: the 3rd modification to file occurs in piece B ₂And B ₃Boundary, the result makes B ₂And B ₃Between anchor lose, two merging become a piece B ₁₂File backup at this time only needs piece B ₁₂Backup replaces original piece B in the past ₂And B ₃

5.2, second backup procedure

Second backup procedure is mainly finished when system is idle relatively by storage server work operations thread, the steps include:

(1) SS:if (Job _x(t _n) .FileSet is empty) changeing (19), else is from Job _x(t _n) get a filename F among the .FileSet _i

(2) SS: be file F _iCreate core buffer F _i.Index, and at F _i.Index create R (F in _i), then the M (F among the chunk cache _i) deposit R (F _i);

(3) SS:if (does not have and F among the chunk cache _iRelevant tuple) change (14), else reads one and F from chunk cache _iRelevant tuple;

(4) SS:if (is＜H _j, offset 〉), change step (12);

(5) SS:else if (is＜H _j, D (H _j))

(6) SS: in hash table, inquire about H _j

(7) SS:if (finding) writes " offset " value the H of index cache neutralization _jIn the corresponding list item, change step (12);

(8) SS:else{ is D (H _j) be appended to Disk Logs, upgrade hash table simultaneously;

(9) SS: the H that " offset " value is write index cache neutralization _jIn the corresponding list item, change step (12);

(10) SS:else if (is＜H _j, null 〉)

(11) SS: H from index cache _jRead " offset " value in the corresponding list item;

(12)SS：insert(<H _j，offset，dc>，0，F _i.Index)；

(13) SS: return step (3);

(14)SS：storeRemain(F _i.Index，R(F _i))；

(15) SS: R (F _i) be appended to Disk Logs, upgrade hash table simultaneously;

(16) SS: R (F _i) send BS to;

(17) BS: R (F _i) be sent to catalog data base and be stored in Job _x(t _n) among the .Record;

(18) SS: return step (1);

(19) SS: create file Job _x(t _n) .FF;

(20) SS: read index cache,,, offset＜H to the list item of each eligible (tag==1000ortag==1100)〉write file Job _x(t _n) among the .FF;

(21) SS: file Job _x(t _n) .FF sends BS to;

(22) BS: file Job _x(t _n) .FF is sent to catalog data base and is stored in Job _x(t _n) among the .Record;

(23) SS: to BS report operation Job _x(t _n) done state;

(24) BS: interruption and SS are connected, operation Job _x(t _n) done state write the Job of catalog data base _x(t _n) among the .Record, and finish operation Job _x(t _n) operation.

In above-mentioned algorithm, the algorithm of step (12) and (14) two functions is as follows:

Step (12) algorithm

insert(<H，offset，type>，level，F.Index)

∥ stores tlv triple＜H, offset, type〉to F.Index.

//level: the storage tlv triple＜H, offset, type〉the level number of index node in index tree F.Index.

if(I(F，level)＝)

{ create I _w(F, level);＜H, offset, type〉store I into _w(F, level); Return; }

Else if (I _w(F, level) less than)

Storage＜H, offset, type〉to I _w(F, level) in; Return; }

Else if (I _w(F, level) full)

{ calculate H (I _w(F, level));

In hash table, inquire about H (I _w(F, level));

If does not find

I _w(F level) is appended to Disk Logs, upgrades hash table simultaneously;

insert(<H(I _w(F，level))，offset，ic>，level+1，F.Index)；

Create a new index node I _w(F, level);

Storage＜H, offxet, type〉to I _w(F, level) in; Return;

}

Step (14) algorithm

storeRemain(F.Index，R(F))

{ // active index the node of each layer among the E.Index is stored in the Disk Logs.

int?level：＝0；

Loop: calculate H (I _w(F, level));

In hash table, inquire about H (I _w(F, level));

If does not find

I _w(F level) is appended to Disk Logs, upgrades hash table simultaneously;

if(|I(F，level)|＝1)

{ storage＜H (I _w(F, level)), offset, ic〉to R (F); Return; }

else

{insert(<H(I _w(F，level))，offset，ic>，level+1，F.Index)；

level：＝level+1；goto?loop；

}

Claims

1. the data backup system based on fingerprint comprises backup server, backup agent, storage server and Web server, and they intercom mutually by network and finish data backup and recovery, it is characterized in that:

2. the data backup system based on fingerprint as claimed in claim 1 is characterized in that, described backup server comprises backup server initialization module, order monitoring module, command processing module, operation processing module and network communication module;

Described backup services initialization module is carried out initial work, comprises reading configuration file, set up resource chained list in the internal memory, check catalog data base state, the data consistency that guarantees configuration file and catalog data base and integrality, startup command policing port, accepting user command, initialization job queue and user command formation, load operations object, initiating task and network monitoring service in job queue from Web server;

3. the data backup system based on fingerprint as claimed in claim 1 is characterized in that, described backup agent comprises backup agent initialization module, request monitoring module, operation processing module, file block module and network communication module;

4. the data backup system based on fingerprint as claimed in claim 1, it is characterized in that, described storage server comprises the storage server initialization module, connects monitoring module, operation bill table, operation processing module and network communication module, and index buffer zone, blocking and buffering district, piecemeal Hash table and Disk Logs;