CN108804253A

CN108804253A - A kind of concurrent job backup method for mass data backup

Info

Publication number: CN108804253A
Application number: CN201710301054.5A
Authority: CN
Inventors: 姚秋玲; 陈德清
Original assignee: Institute of High Energy Physics of CAS
Current assignee: Institute of High Energy Physics of CAS
Priority date: 2017-05-02
Filing date: 2017-05-02
Publication date: 2018-11-13
Anticipated expiration: 2037-05-02
Also published as: CN108804253B

Abstract

The invention discloses a kind of concurrent job backup methods for mass data backup.This method is：1）It chooses more backup nodes and forms a backup cluster, each backup node has unified configuration；2）Terminal chooses a backup node as archive management server, starts the backup policy of object to be backed up；3）Archive management server chooses a backup node as job scheduler, successively obtains the corresponding bibliographic structure of the backup object, often obtains a catalogue and generates a scanning operation；4）Each scanning operation and corresponding working path are submitted to job scheduler by archive management server；Job scheduler sends it to backup node and is scanned to the target directory in scanning operation；5）The archive management server chooses file to be backed up and generates several file sublists；A copy operation, which is generated, according to each sublist is sent to job scheduler；6）Different copy operations are sent to different backup nodes by job scheduler, by file copy to be backed up to corresponding position.

Description

A kind of concurrent job backup method for mass data backup

Technical field

The present invention relates to a kind of data back up method more particularly to a kind of concurrent job backups for mass data backup Method.

Background technology

Data are vital for an enterprise, department, entity or individual.Due to various reasons, it for example sets Standby failure, hacker's virus, artificial maloperation etc., once data information is lost or is destroyed, it will it causes inestimable Loss, this makes data backup become extremely important.Data backup is a kind of Data Security, does one to critical data and copies Shellfish, when failure occurs, to restore data, the loss for avoiding loss of data from bringing by backup software.

With the continuous development of information technology, the Newly Sprouted Things such as cloud computing, Internet of Things, social networks make human society Data class and scale explosive growth in the world.By the end of 2012, data volume was from TB (1TB=1024GB) Rank rises to PB (1PB=1024TB), EB (1EB=1024PB) or even ZB (1ZB=1024EB) rank.The big data epoch It arrives, while also promoting increasing rapidly for backup requirements amount, the mass data of TB and bigger brings new to data backup Challenge.

In addition, the storage mode of data also tends to diversification：There is the traditional Relational DataBase of structuring；Have unstructured Non-relational database；Also GFS and HDFS is the distributed file system of representative.With the increasing of data volume and data class It is more, to the backup of these data become to become increasingly complex with it is time-consuming.

In face of mass data, how software and hardware resources are made full use of, meets different backup requirements, quickly and effectively complete Data backup and resume is the main purpose of design and research standby system.Existing backup software, there are Railway Projects：

1. being designed not directed to the backup of mass data.In backup procedure, most important is exactly to copy backup object On shellfish portion copy to other machine.In the process, many backup softwares using single data stream mode carry out data copy and Transmission, is limited by server or network bandwidth, can not promote backup rate and capacity.Standby to thousands of or tens of thousands of a files progress When part, performance is fine.But to comprising ten million even hundred million files mass data when, need several days even several weeks time, Backup tasks can not be completed in acceptable time range.

2. there are the possibility of Single Point of Faliure in standby system.Some standby systems construct more backup servers, but not Same backup server is responsible for different backup services.Once certain server failure, the backup defined on this server and Restoring service can not just continue.

3. backup software is for security consideration, using customized storage format, backup file depends on backup software, when When software fault, backup file can not provide use, lead to have backup equal to the result of no backup.

Invention content

In view of the deficiencies in the prior art, the purpose of the present invention is to provide it is a kind of for mass data backup and Row operation backup method submits the mode of concurrent job to obtain the mass data list for needing to back up by structure backup cluster, Further according to the backup policy of customization, parallel backup job is submitted, and backup file is preserved with the linux file formats of standard.

The present invention includes several block structures：

1. mechanical floor：All hardware resource includes by backup node --- the backup cluster that multiple stage computers form, storage The distributed file system that resource --- more disk arrays are constituted, Internet resources etc..It is standby in order to remove the possibility of Single Point of Faliure Every node in part cluster is unified installation operation system, standby system and job scheduling and is joined with software, unified configuration is checked Number, all backup policy of unified definition and server list.Each backup node all can serve as archive management server and Operation dispatching server.Will be distributed over the disk on multiple disk arrays, filesystem manner is virtually patrolled as one in a distributed manner Memory space is collected, which is mounted to a share directory on each backup node.It is every in backup cluster One backup node uses the disk array of bottom by way of accessing the share directory.In backup procedure preservation in need Data, including mysql databases, operation and job result, backup file, daily record etc. are all stored in the share directory.By certain One backup node serves as archive management server, is responsible for entire backup management task, when it fails, enables at once another One backup node is as management server.Another backup node is received and is dispatched for operation as operation dispatching server. Due to all backup node information unifications in cluster, resource-sharing, so single machine failure does not interfere with whole system Normal operation.

2. management level：This layer includes all software and application, including backup policy, data base administration, job scheduling, work Industry inspection, several aspects such as access mandate, each backup node include these softwares and application.

1) customized backup policy.Different data have different backup requirements, can be according to specific requirements, customization backup Capacity, frequency of your backups, backup mode (backup, incremental backup or differential backup completely), retention time, access level etc..

2) mysql database services are installed, dedicated mysql tables is established for backup, records the directory information of backup object And fileinfo, in backup procedure, file polling and recovery to can also be provided.

3) for the backup of mass data, in order to improve backup rate, in the form of operation in backup cluster concurrently Complete the copy of the extraction and backup object of listed files.The operation of identical function is to include the foot of identical program and different objects This, is generated, and be sent in job scheduler with fixed format.Job scheduler is according to job execution time, backup section The backup node that the state selection of point executes.

4) operation run on backup node has regularly review of operations script, failure operation to need to enter operation tune It is executed again in degree device.

5) backup object is defined with catalogue in standby system, a backup object can be made of multiple catalogues.Backup node The read-only machine list of upper definition mandate, definition format are：Machine name where catalogue+backup object or the addresses ip.Authorized machine Oneself relevant backup fileinfo on backup node can be inquired, but the backup file of backup node can not be deleted, can not be inquired With the other catalogue of recovery.

3. service layer：All services and api interface that standby system is externally provided.Application including data backup, number Request according to recovery and the data to backup are checked and are retrieved.There are linux orders row format and api interface simultaneously, for looking into Ask the directory information and fileinfo backed up.

To achieve the above object, the present invention adopts the following technical scheme that：

A kind of parallel backup method for handling mass data includes the following steps (as shown in Figure 1)：

1. checking backup policy.Each backup object can define corresponding backup policy before backup.Back-up plan After startup, being defined for this backup procedure is obtained in backup policy, including checks that backup object whether there is and award Reading permission is given, whether the catalogue involved in backup procedure is normal, this backup is (full backup, increment in the form of which kind of backup Backup or differential backup) it carries out, in the storage to which catalogue or medium of backup file, archive management server and job scheduling clothes Whether device be engaged in normally without delay machine, listed files according to what threshold value come cutting (data capacity or quantity of documents), after Backup end Which need that daily record and information recorded and submit.Whether the directory information table and file of the backup object are had in mysql databases Information table backs up this backup object if it is first time, then initially sets up the two database tables.

2. generating the concurrent job (as shown in Figure 2) of directory scan.Since each backup object is defined with catalogue form, The bibliographic structure of catalogue is successively obtained by depth-first traversal algorithm, a catalogue is often obtained, in the catalogue of mysql databases It is inserted into a directory information in information table, while generating a scanning operation, i.e. an executable script, content scan+ Target directory name, then submits to job scheduler by scanning operation and working path.

3. scanning operation is dispatched and operation (as shown in Figure 3).Job scheduler is according to the synthesis of each node in backup cluster State evaluation, including free memory amount, load values, cpu occupy the idle backup node of the selection such as sum and operation number of processes As execution node, and scanning operation name and working path are sent to the backup node.The node executes under working path Scanning operation is scanned the target directory defined in the job script, records the All Files information under the catalogue, and Mysql tables of data is that each file is inserted into a fileinfo.After the completion of all scanning operations, the backup object it is complete Directory information and fileinfo be stored in mysql databases.

4. the operating file copy (as shown in Figure 4) in a manner of concurrent job.It is full backup, incremental backup according to this backup Or differential backup, the file that extraction needs back up from directory information table in mysql databases and file information table are recorded into one In a text file.If it is full backup, by the All Files in file information table, including the filename of belt path and file it is big It is small, listed files text is written one by one.If it is incremental backup or difference, the file needed is extracted according to filemodetime, Listed files text is equally written in filename including belt path and file size one by one.According to cutting threshold value (data capacity or Person's quantity of documents) segmentation listed files, multiple copy operations are generated, the backup of filename+belt path of tar+ belt paths is included Filename submits job scheduler, and operation is assigned on idle backup node by scheduler and is run.Backup node executes Tar programs generate backup file in the path (i.e. the backup filename of belt path) of operational definition.Different operations is responsible for Different file copies generates different backup files, is run parallel in different backup nodes.

5. the inspection and summary of backup tasks.The review of operations：Each scanning operation or copy operation have setting when Between threshold value, once it is overtime, failure information will be returned to job scheduler.All failure operations of review of operations program checkout, and Job scheduler is resubmited in the operation for generating same content.The most multiple throwing of the operation of same content is secondary, once it is more than number Limitation just generates in error information write-in backup log.Resource inspection：It checks and records in backup cluster on each backup node Load values, memory usage amount, cpu occupancies, disk array use ratio etc..It was found that when having mechanical disorder delay machine, operation is notified Scheduler deletes the machine name.When having new engine addition, notice job scheduler increases the machine name.It was found that part storage without Remaining space or when having Input/Output failures, generates warning message and backup log is written.Daily record is summarized：According to backup plan Definition slightly records this and backs up running useful information, including backup server name, backup object, backup total capacity are standby Part time, backup mode, sniffing is accidentally and alarm in backup procedure.

6. the backup for inquiry and recovery service.If user needs the data for restoring to lose, standby system can be first First audit whether the machine is authorized machine and corresponding directory name.Restore service mainly realized with interactive script, user with Recovery orders start script and execute, and need the directory name or filename that restore according to the prompt input of textual interface, restore The time point of data restores destination path etc., and system needs the backup file where recovery file according to the matching of these parameters, And by backup file XietarBao, and will be in required file copy to user's specified path.User can also call API, with The mode of order+parameter checks backup fileinfo.

The positive effect of the present invention：

The present invention completes scanning and the data copy task of mass data by the way of concurrent job, gives full play to calculating The backup tasks of huge data volume are rapidly completed in the performance of machine cluster.Backup result is saved in multiple backup files, is being restored It also only needs to extract incremental backup file when data, accelerates file access pattern speed.Each backup node configuration one in standby system It causes, systemic breakdown caused by single machine failure is removed in resource-sharing；Backup is executed with the tar orders of linux standards and data are copied The backup file of shellfish, generation is preserved with tar formats, and the tar programs that linux or windows operating systems carry can be used to read, Standby system, which need not be relied on, can restore data, improve the availability of standby system.The compression of Tar orders and encryption parameter The danger that selection decreases the consumption of network and data are stolen.After each backup, there is task to check automatically and daily record It is automatic to submit, the robustness of standby system is improved, administrative burden is reduced.

Description of the drawings

Fig. 1 is the parallel backup method flow chart of the present invention；

Fig. 2 is the generation method flow chart of scanning operation；

Fig. 3 is scanning operation scheduling and operation method flow chart；

Fig. 4 is the generation method flow chart of copy operation.

Specific implementation mode

With reference to specific example, the present invention is described further.

With above an entitled login machine /home catalogues backup as example.Name is that bak01 machines are first It is responsible for the server of backup, it starts selfserver services and checks oneself process first, checks that state is normal in the machine, can obtain/ The relevant configurations of home, each backup process exist.The backup finger daemon backup_agent of startup/home later.backup_ Start checkconf scripts in agent, read the backup policy of defined good/home catalogues, return needed for parameter：

1. backup source directory：login:/home；

2. frequency of your backups：The primary backup of operation daily；

3. backup grade：0 (this backup is using backup completely)；

4. access level：Private (non-public, the root user on login machines can restore data)；

5. storage catalogue：bak01:/gluster/daily/login_home；

6. Log Directory：bak01:/gluster/$date；

7. task run server：bak01；

8. dispatch server：bak06；

9. cutting threshold value：Default (every 20000 files of acquiescence are syncopated as a copy process)；

10. whether encrypting：It is；

11. retention time：One month；

12. daily record mail collects address：heguans@hotmail.com；

13. log recording selects：All information, including backup are summarized, and operation reports an error and warning message；

Check that source directory whether there is and readable, storage share directory whether there is and writeable, and Log Directory whether there is And it is writeable, operation dispatching server is available.

Later, backup_agent processes start baklog processes and finddir processes on bak01.Baklog is generated One entitled/gluster/20170218/log_20170218000300 journal file, for preserving known backup source mesh Record, runtime server, the information such as backup grade and storage catalogue, and persistently record the review of operations that following each process will will produce As a result, result and data copy procedure, multiple information such as alarm are thrown in operation again.Finddir processes filter file, only record/ The relevant information of each subdirectory under home, including：Unique ID of catalogue, the absolute path of catalogue, the relative depth (phase of catalogue Corresponding to/home catalogues, for example/home/a registered depths are 2, and/home/a/b registered depths are 3) parent directory title, catalogue Creation time is inserted into the record of the catalogue in mysql databases.For example/home/a catalogues are corresponded to, the ID in database Number it is 46821131, catalogue absolute path is /home/a, and directories deep 2, parent directory title/home, the directory creating time is 2015/08/10.One is generated simultaneously for scanning/the scanning operation of home/a catalogues, entitled scanjob+ random numbers, than In scanjob021 operations to shared storage catalogue/gluster/tmp/job/20170218/login_home, it is sent to work On industry scheduler bak06.

After Bak06 machines receive job request, starts initiating task and inspect periodically process job_handle processes.According to Algorithm obtains the state evaluation value of all machines in current backup cluster, and bak03 machines is selected to start execution/gluster/tmp/ Job/20170218/login_home/scanjob21 scripts, the file in right/home/a is scanned, and filters/home/a In subdirectory.A file is often scanned through, a file record, including file unique ID number are inserted into mysql databases, Filename, file absolute path, the user name of file owners, group name, file size, last modification time, final state change Become the time.After the completion of scanjob21 operations on Bak03, job scheduler bak06 is notified, and generating simultaneously Scanjob21.e files and scanjob21.o files are to/gluster/tmp/job/20170218/login_home catalogues.Such as Fruit scanjob21.e file sizes are more than 0, then illustrate that the operation of scanjob21 operations is problematic, do not returned correctly Value.

The job_handle processes being periodically executed can check all job result e files, and resubmit this Operation, and the number of the operation submitted again is recorded, until whole reviews of operations and operation are thrown and are terminated again.Once job scheduling When operation in device is 0, job_handle processes are out of service, and return to the backup_agent processes of a value bak01.

The backup grade 0 that backup_agent processes are obtained according to the first step, extraction/home from the library of database All Files list under catalogue, and according to cutting threshold value, every 20,000 files just generate a copy operation.Submit all copy In shellfish operation, such as dumpjob155 to job scheduler bak06 machines.Second of startup job_handle process of Bak06, and Bak19 is selected to start execution/gluster/tmp/job/20170218/login_home/ according to current state evaluation of estimate File is copied using the linux tar programs carried in dumpjob155, dumpjob155, and the backup of generation is literary Part is preserved into/gluster/daily/login_home/20170218 catalogues.After the completion of all copy operations execute, together Sample generates dumpjob155.e files and dumpjob155.o files, so that job_handle process checks are handled.In addition ,/ An index155 file is generated in gluster/daily/login_home/20170218/index, for recording this 20,000 The title and file size of file.After job_handle processes are out of service, the backup_agent that is equally notified that on bak01 Process.

Finally, backup_agent process initiations mailman processes call the sendmai programs of linux, will/ Content in gluster/20170218/log_20170218000300 is sent to relevant administrator with lettergram mode heguans@hotmail.com.Backup_agent process initiation garbage processes cleaning simultaneously owns/gluster/tmp/ Each catalogue under job and file, avoid leaving garbage files.So far, this on login /home catalogues backup complete, Backup_agent processes are out of service.

If certain files of user's needs pair/home restore, he must log in login machines in a manner of root, Recovery orders are executed, step by step according to prompt, the directory name or filename for needing to restore is submitted, restores the date of file (such as 20170218), the position (such as/tmp) of file access pattern.Bak01 can retrieve/gluster/daily/login_home/ Index files under 20170218/index extract his required file copy to login:In/tmp catalogues.

Claims

1. a kind of concurrent job backup method for mass data backup, step are：

1) multiple stage computers are chosen and form a backup cluster as backup node, each backup node has unified configuration； Each disk array is connect in a manner of logical volume with each backup node, and a backup database is built on logical volume；

2) it needs the terminal backed up to choose a backup node as archive management server, and is opened on the archive management server Move the backup policy of object to be backed up；Wherein, the backup object is defined with catalogue form, i.e., each backup object corresponds to a mesh Record；

3) archive management server chooses a backup node as operation dispatching server according to the backup policy, and checking should The directory information table and file information table that whether there is the backup object in backup database, if it is not, establishing the backup The directory information table and file information table of object；Then the corresponding bibliographic structure of the backup object is successively obtained, a mesh is often obtained Record is inserted into a directory information in the directory information table, and generates a scanning operation；The scanning operation includes scanner program Name and target directory name；

4) each scanning operation and corresponding working path are submitted to the operation dispatching server by the archive management server；The work Industry dispatch server is chosen several backup nodes and is sent out as execution node, and by each scanning operation and its corresponding working path Give an execution node；Each node that executes is scanned the target directory in the scanning operation that receives, records the target directory Under All Files information, and be that each file for scanning is inserted into a fileinfo in this document information table；

5) archive management server chooses file generated to be backed up according to the backup policy, directory information table and file information table One listed files, and cutting is carried out to this document list according to cutting threshold value, obtain several sublists；

6) archive management server generates a copy operation according to each sublist, and each copy operation is sent to the operation tune Spend server；The copy operation includes the backup filename of program of file copy name, the filename of belt path and belt path；

7) different copy operations are sent to different backup nodes by the operation dispatching server, and backup node is copied according to what is received Shellfish operation is by corresponding position in corresponding file copy to be backed up to the logical volume.

2. the method as described in claim 1, which is characterized in that the information in the backup policy includes the reading of backup object Permission, backup form, as the backup node of operation dispatching server, are cut in the storage to which catalogue or medium of backup file Divide threshold value, record and the daily record submitted and information are needed after Backup end.

3. method as claimed in claim 2, which is characterized in that when the recovery backup that the backup cluster receives that a terminal is sent out When object requests, which first audits whether the terminal is the terminal authorized in the backup policy；If it is mandate Terminal then prompts terminal input to need the directory name restored or filename, restore the time point of data and restore destination road Diameter；Then according to these input informations search need restore file where backup file, and by the backup file copy to finger Determine in path.

4. the method as described in claims 1 or 2 or 3, which is characterized in that the scanning operation, the copy operation are set Fixed time threshold returns to failure information to operation dispatching server if being more than the time threshold if the execution time；Operation tune Degree server is that the scanning operation of failure or the copy operation are chosen backup node and re-executed；If swept described in same The execution number for retouching operation or the copy operation is more than setting threshold value, then stops executing corresponding operation and generate error letters In breath write-in backup log.

5. the method as described in claims 1 or 2 or 3, which is characterized in that when the operation dispatching server is according to the execution of operation Between and backup node state selection execute backup node；The wherein described operation is that the scanning operation or the copy are made Industry.

6. the method as described in claims 1 or 2 or 3, which is characterized in that disk array filesystem manner in a distributed manner Carry, and a Uniform Name is provided.

7. the method as described in claims 1 or 2 or 3, which is characterized in that will be distributed over the disk on multiple disk arrays to divide Cloth filesystem manner virtually becomes a logical memory space, which is mounted to a share directory On each backup node；Each backup node in backup cluster uses the disk of bottom by way of accessing the share directory Array.