CN108804253B

CN108804253B - Parallel operation backup method for mass data backup

Info

Publication number: CN108804253B
Application number: CN201710301054.5A
Authority: CN
Inventors: 姚秋玲; 陈德清
Original assignee: Institute of High Energy Physics of CAS
Current assignee: Institute of High Energy Physics of CAS
Priority date: 2017-05-02
Filing date: 2017-05-02
Publication date: 2021-08-06
Anticipated expiration: 2037-05-02
Also published as: CN108804253A

Abstract

The invention discloses a parallel operation backup method for mass data backup. The method comprises the following steps: 1) selecting a plurality of backup nodes to form a backup cluster, wherein each backup node has uniform configuration; 2) the terminal selects a backup node as a backup management server and starts a backup strategy of the object to be backed up; 3) the backup management server selects a backup node as an operation scheduler, acquires the directory structure corresponding to the backup object layer by layer, and generates a scanning operation when acquiring a directory; 4) the backup management server submits each scanning operation and the corresponding operation path to the operation dispatcher; the job scheduler sends the target directory to the backup node to scan the target directory in the scanning job; 5) the backup management server selects files to be backed up and generates a plurality of file sub-tables; generating a copy job according to each sub-table and sending the copy job to a job scheduler; 6) and the job scheduler sends different copy jobs to different backup nodes and copies the files to be backed up to corresponding positions.

Description

Parallel operation backup method for mass data backup

Technical Field

The present invention relates to a data backup method, and more particularly, to a parallel job backup method for mass data backup.

Background

Data is vital to an enterprise, department, organization, or individual. Data backup becomes very important because once data information is lost or destroyed, the data information is lost or damaged due to various reasons, such as equipment failure, hacker virus, human misoperation, etc. Data backup is a data security policy, and a copy is made on key data so as to restore the data through backup software when a fault occurs, thereby avoiding loss caused by data loss.

With the continuous development of information technology, emerging things such as cloud computing, internet of things and social networks enable the data types and scales of human society to increase explosively on a global scale. By 2012, the amount of data has stepped from the TB (1TB 1024GB) level to the PB (1PB 1024TB), EB (1EB 1024PB), and even ZB (1ZB 1024EB) level. The advent of the big data era also promotes the rapid increase of the backup demand, and TB and larger mass data bring new challenges to data backup.

In addition, the storage manner of data also tends to be diversified: structured traditional relational databases; there is an unstructured non-relational database; there are also distributed file systems typified by GFS and HDFS. As the amount and variety of data increases, the backup of such data becomes more complex and time consuming.

In the face of mass data, how to fully utilize software and hardware resources, meet different backup requirements, and quickly and effectively complete data backup and recovery is a main purpose of designing and researching a backup system. The existing backup software has several problems:

1. and is not designed for the backup of mass data. During the backup process, it is most important to copy the backup object to another machine. In the process, a plurality of backup software copies and transmits data in a single data stream mode, and the backup speed and capacity cannot be increased due to the limitation of a server or network bandwidth. The performance is good when several thousand or tens of thousands of files are backed up. However, for massive data containing tens of millions or even hundreds of millions of files, several days or even weeks are needed, and the backup task cannot be completed within an acceptable time range.

2. There is a single point of failure potential in the backup system. Some backup systems build multiple backup servers, but different backup servers are responsible for different backup services. Once a server fails, the backup and restore services defined on that server cannot proceed.

3. The backup software adopts a self-defined storage format for safety consideration, the backup file depends on the backup software, and when the software fails, the backup file cannot be used, so that the backup is equal to the result of no backup.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a parallel job backup method for mass data backup, which comprises the steps of obtaining a mass data list to be backed up in a mode of constructing a backup cluster and submitting parallel jobs, submitting the parallel backup jobs according to a customized backup strategy and storing backup files in a standard linux file format.

The invention comprises several structural blocks:

1. equipment layer: all hardware resources, including backup clusters composed of backup nodes-multiple computers, storage resources-distributed file systems composed of multiple disk arrays, network resources, etc. In order to remove the possibility of single point of failure, each node in the backup cluster uniformly installs an operating system, a backup system and job scheduling and checking software, uniformly configures parameters, and uniformly defines all backup strategies and server lists. Each backup node may act as a backup management server and a job scheduling server. The method comprises the steps of virtualizing disks distributed on a plurality of disk arrays into a logical storage space in a distributed file system mode, and mounting the logical storage space to each backup node in a shared directory. Each backup node in the backup cluster uses the underlying disk array by accessing the shared directory. All data to be saved in the backup process, including mysql database, jobs and job results, backup files, logs, etc., are saved in the shared directory. One backup node is used as a backup management server and takes charge of the whole backup management task, and when the backup management server fails, another backup node is started to serve as the management server. And the other backup node is used as a job scheduling server for job receiving and scheduling. Because all backup node information in the cluster is uniform and resources are shared, the normal operation of the whole system cannot be influenced by the fault of a single machine.

2. And (3) a management layer: the layer comprises all software and applications, including backup strategy, database management, job scheduling, job checking, access authorization and the like, and each backup node comprises the software and the applications.

1) Customizable backup strategies. Different data have different backup requirements, and the backup capacity, the backup frequency, the backup mode (full backup, incremental backup or differential backup), the retention time, the access level and the like can be customized according to specific requirements.

2) And installing the mysql database service, establishing a special mysql table for backup, recording directory information and file information of a backup object, and providing file query and recovery in the backup process.

3) For the backup of mass data, in order to improve the backup speed, the extraction of a file list and the copying of a backup object are completed in parallel in a backup cluster in the form of a job. Jobs of the same function are scripts containing the same program and different objects, generated in a fixed format, and sent to the job scheduler. And the job scheduler selects the executed backup node according to the job execution time and the state of the backup node.

4) The jobs running on the backup nodes have regular job check scripts, and failed jobs need to enter the job scheduler to be executed again.

5) In the backup system, backup objects are defined by directories, and one backup object can be composed of a plurality of directories. An authorized read-only machine list is defined on the backup node, and the defined format is as follows: directory + the machine name or ip address where the backup object is located. The authorized machine can inquire the information of the backup files related to the authorized machine on the backup node, but can not delete the backup files of the backup node and can not inquire and restore other directories.

3. And (3) a service layer: and the backup system is used for backing up all services and API interfaces provided externally. Including applications for data backup, requests for data recovery, and viewing and retrieving backed up data. And the system also has a linux command line format and an API (application programming interface) interface, and is used for inquiring the backed-up directory information and file information.

In order to achieve the purpose, the invention adopts the following technical scheme:

a parallel backup method for processing mass data, comprising the following steps (as shown in fig. 1):

1. the backup strategy is checked. Each backup object defines a corresponding backup strategy before backup. After the backup plan is started, all definitions of the backup process are obtained in the backup strategy, including whether a backup object exists and is granted with a reading authority or not, whether a directory related in the backup process is normal or not, which backup form (full backup, incremental backup or differential backup) the backup is performed, in which directory or medium a backup file is stored, whether a backup management server and a job scheduling server are normal or not, according to what threshold value a file list is divided (data capacity or file number), and which logs and information need to be recorded and submitted after the backup is finished. And if the mysql database has the directory information table and the file information table of the backup object, firstly establishing the two database tables if the backup object is backed up for the first time.

2. A parallel job of directory scans is generated (as shown in figure 2). Because each backup object is defined in a directory form, a directory structure of the directory is acquired layer by layer through a depth-first traversal algorithm, every time a directory is acquired, a piece of directory information is inserted into a directory information table of the mysql database, and simultaneously, a scanning job, namely an executable script, is generated, the content of the scanning job and the target directory name are used, and then the scanning job and the job path are submitted to a job scheduler.

3. The scan job is scheduled and run (as shown in fig. 3). And the job scheduler selects idle backup nodes as execution nodes according to the comprehensive state evaluation of each node in the backup cluster, including the residual memory amount, the load value, the CPU occupation total number, the running process number and the like, and sends the scanning job names and the job paths to the backup nodes. And the node executes scanning operation under the operation path, scans a target directory defined in the operation script, records all file information under the directory, and inserts a piece of file information for each file in the mysql data table. When all the scanning operations are completed, the complete directory information and file information of the backup object are stored in the mysql database.

4. The file copy is run in parallel jobs (as shown in fig. 4). And extracting files to be backed up from a directory information table and a file information table in the mysql database and recording the files into a text file according to whether the backup is full backup, incremental backup or differential backup. If the file is a full backup, all files in the file information table, including the file name with the path and the file size, are written into the file list text one by one. If the file is incremental backup or difference, extracting the required file according to the file modification time, including the file name with the path and the file size, and similarly writing the file list text one by one. And dividing the file list according to a segmentation threshold (data capacity or file quantity), generating a plurality of copy jobs, wherein the copy jobs comprise tar + file name with path + backup file name with path, submitting a job scheduler, and allocating the jobs to idle backup nodes by the scheduler to run. The backup node executes the tar program and generates a backup file in a path defined by the job (i.e., a backup file name with a path). And different jobs are responsible for copying different files, generating different backup files and running in parallel at different backup nodes.

5. And (5) checking and summarizing backup tasks. And (4) operation inspection: each scanning job or copying job has a set time threshold, and once overtime, failure information is returned to the job scheduler. The job checker checks all failed jobs and generates jobs of the same content to resubmit the job scheduler. The operation with the same content is repeatedly thrown at most twice, and once the number of times exceeds the limit, error information is generated and written into the backup log. And (3) checking resources: and checking and recording the load value, the memory usage amount, the cpu occupancy rate, the disk array usage ratio and the like on each backup node in the backup cluster. And when the machine fault is discovered to be down, informing the job scheduler to delete the machine name. When a new machine is added, the job scheduler is notified to add the machine name. And when no residual space exists in the part storage or an Input/Output fault exists in the part storage, generating alarm information and writing the alarm information into a backup log. And (3) summarizing logs: according to the definition of the backup strategy, useful information in the backup operation is recorded, wherein the useful information comprises a backup server name, a backup object, a backup total capacity, backup time, a backup mode, errors in the backup process and an alarm.

6. Backup and restore services available for queries. If the user needs to recover the lost data, the backup system will first check if the machine is an authorized machine and the corresponding directory name. The recovery service is mainly realized by an interactive script, a user starts script execution by a recovery command, a directory name or a file name needing to be recovered is input according to the prompt of a character interface, the time point of data recovery, a destination path recovery and the like are recovered, the system matches the backup file where the file needs to be recovered according to the parameters, the backup file is de-tard packaged, and the needed file is copied to a path appointed by the user. The user may also call the API to view the backup file information in a command + parameter manner.

The invention has the following positive effects:

the invention adopts a parallel operation mode to complete the scanning and data copying tasks of mass data, fully exerts the performance of a computer cluster and quickly completes the backup task of huge data volume. The backup result is stored in a plurality of backup files, only part of the backup files need to be extracted when the data is restored, and the file restoration speed is accelerated. All backup nodes in the backup system are configured consistently, resources are shared, and system paralysis caused by single machine failure is eliminated; backup and data copy are executed by a tar command of a linux standard, the generated backup file is stored in a tar format and can be read by a tar program carried by a linux or windows operating system, data can be recovered without depending on a backup system, and the availability of the backup system is improved. Compression and encryption parameter selection of the Tar command also reduces network consumption and risk of data theft. After each backup, the task automatic check and the log automatic submission are provided, so that the robustness of the backup system is improved, and the management burden is reduced.

Drawings

FIG. 1 is a flow chart of a parallel backup method of the present invention;

FIG. 2 is a flow chart of a method of generating a scan job;

FIG. 3 is a flowchart of a scan job scheduling and running method;

FIG. 4 is a flow chart of a method of generating a copy job.

Detailed Description

The present invention is further described below with reference to specific examples.

Take a/home directory on a machine named logic as an example for backup. First, the machine named bak01 is the server responsible for backup, and it first starts the self service self-check process, checks the state on the machine is normal, and can acquire/home related configuration, and each backup process exists. Then the backup daemon backup _ agent of the/home is started. Starting a checkconf script in the backup _ agent, reading a defined backup strategy of the home directory, and returning required parameters:

1. backing up a source directory: login is/home;

2. backup frequency: running a backup once a day;

3. backup level: 0 (the backup adopts a complete backup);

4. and (3) access level: private (non-public, root user on login machine only can retrieve data);

5. storing a catalog: bak 01:/cluster/day/login _ home;

6. log directory: bak 01:/$ date/cluster;

7. the task operation server: bak 01;

8. a scheduling server: bak 06;

9. segmentation threshold value: defaults (default cuts one copy process every 20000 files);

10. whether to encrypt: is that;

11. retention time: one month;

12. log mail receiving address: heguans @ hotmail.com;

13. selecting a log record: all information including backup summary, job error reporting and alarm information;

checking whether the source directory exists and can be read, whether the storage sharing directory exists and can be written, whether the log directory exists and can be written, and the job scheduling server can be used.

Thereafter, the back _ agent process starts the baklog process and finddir process on bak 01. The Baklog generates a log file named as/cluster/20170218/log _20170218000300, which is used for storing information such as a known backup source directory, an operating server, a backup level and a storage directory, and continuously records a plurality of information such as a job check result, a job re-projection result, a data copying process, an alarm and the like which are generated by each process below. The finddir process filters files, records only the related information of each subdirectory under the home, and comprises the following steps: the unique ID of the directory, the absolute path of the directory, the relative depth of the directory (corresponding to/home directory, e.g.,/home/a record depth of 2,/home/a/b record depth of 3), parent directory name, directory creation time, insert a record for the directory in the mysql database. For example, the corresponding/home/a directory, the ID number in the database is 46821131, the directory absolute path is/home/a, the directory depth is 2, the parent directory name is/home, and the directory creation time is 2015/08/10. A scan job for the scan/home/a directory is generated at the same time, named scanjob + random number, such as scanjob021 to shared memory directory/cluster/tmp/job/20170218/logic _ home, and sent to job scheduler bak 06.

After receiving the job application, the Bak06 machine starts to start the job periodic check process jobhandle process. And acquiring state evaluation values of all machines in the current backup cluster according to an algorithm, selecting a bak03 machine to start executing a/cluster/tmp/joba/20170218/login _ home/scanjoba 21 script, scanning files in the/home/a, and filtering subdirectories in the/home/a. And inserting a file record into the mysql database after scanning a file, wherein the file record comprises a unique file ID number, a file name, an absolute file path, a user name of a file owner, a group name, a file size, final modification time and final state change time. After the scanjobb 21 job on Bak03 is completed, the job scheduler Bak06 is notified, and scanjobb 21.e file and scanjobb 21.o file are generated to/cluster/tmp/jobb/20170218/login _ home directory at the same time. If the size of the scanjob21.e file is larger than 0, the operation of the scanjob21 job is problematic, and a correct return value is not obtained.

The regularly executed job _ handle process checks all job result e files, resubmits the job, and records the resubmission times of the job until all job checks and job resubmissions are finished. Once the job in the job scheduler is 0, the job _ handle process stops running and returns a value bak01 to the backup _ agent process.

And the backup _ agent process extracts all file lists under the home directory from the file library of the database according to the backup level 0 acquired in the first step, and generates a copy operation every 2 ten thousand files according to the segmentation threshold value. All copy jobs, such as dumpjoba 155, are submitted to the job scheduler bak06 machine. Bak06 starts the job _ handle process for the second time, selects Bak19 to start executing/cluster/tmp/job/20170218/logic _ home/dumpjob155 according to the current state evaluation value, copies the file in dumpjob155 by using the tar program of linux and stores the generated backup file in the directory of/cluster/file/logic _ home/20170218. After all copy jobs have been executed, a dumpjob155.e file and a dumpjob155.o file are also generated for job _ handle process check processing. In addition, one index155 file is generated in the/cluster/file/region _ home/20170218/index for recording the names and file sizes of the 2 ten thousand files. After the job _ handle process stops running, the backup _ agent process on bak01 is also notified.

Finally, the backup _ agent process starts the main process, calls the sendmail program of linux, and mails the content in/cluster/20170218/log _20170218000300 to the relevant administrator heguans @ hotmail. And simultaneously, the backup _ agent process starts a garpage process to clean all directories and files under the/cluster/tmp/joba, so as to avoid leaving garbage files. At this time, the backup of the/home directory on the logic is completed, and the backup _ agent process stops running.

If a user needs to recover some files of the/home, the user must log in a logic machine in a root mode, execute a recovery command, submit a directory name or a file name needing to be recovered step by step according to a prompt, recover the date (such as 20170218) of the file, and recover the location (such as/tmp) of the file. Bak01 will retrieve the index file under/cluster/file/logic _ home/20170218/index, extract the file he needs to copy to the logic:/tmp directory.

Claims

1. A parallel operation backup method for mass data backup comprises the following steps:

1) selecting a plurality of computers as backup nodes to form a backup cluster, wherein each backup node has uniform configuration; each disk array is connected with each backup node in a logical volume mode, and a backup database is constructed on the logical volume;

2) selecting a backup node as a backup management server by a terminal needing backup, and starting a backup strategy of an object to be backed up on the backup management server; the backup objects are defined in a directory form, namely each backup object corresponds to a directory;

3) the backup management server selects a backup node as a job scheduling server according to the backup strategy, checks whether a directory information table and a file information table of the backup object exist in the backup database, and if not, establishes the directory information table and the file information table of the backup object; secondly, acquiring a directory structure corresponding to the backup object layer by layer, inserting a piece of directory information into the directory information table when acquiring a directory, and generating a scanning operation; the scanning operation comprises a scanning program name and a target directory name;

4) the backup management server submits each scanning operation and the corresponding operation path to the operation scheduling server; the job scheduling server selects a plurality of backup nodes as execution nodes and sends each scanning job and a job path corresponding to the scanning job to one execution node; each execution node scans a target directory in the received scanning operation, records all file information under the target directory, and inserts a piece of file information for each scanned file in the file information table;

5) the backup management server selects files to be backed up according to the backup strategy, the directory information table and the file information table to generate a file list, and segments the file list according to a segmentation threshold value to obtain a plurality of sub-tables;

6) the backup management server generates a copy job according to each sub-table and sends each copy job to the job scheduling server; the copy job includes a copy program name, a file name with a path, and a backup file name with a path;

7) the operation scheduling server sends different copy operations to different backup nodes, and the backup nodes copy corresponding files to be backed up to corresponding positions in the logical volume according to the received copy operations.

2. The method of claim 1, wherein the information in the backup strategy comprises read authority of a backup object, backup form, directory or medium in which a backup file is stored, backup node as a job scheduling server, a split threshold, and log and information to be recorded and submitted after the backup is finished.

3. The method according to claim 2, wherein when the backup cluster receives a request for restoring the backup object from a terminal, the backup cluster first checks whether the terminal is an authorized terminal in the backup policy; if the terminal is an authorized terminal, prompting the terminal to input a directory name or a file name to be recovered, a time point of recovering data and a recovery destination path; and then searching a backup file where the file needing to be restored is located according to the input information, and copying the backup file to a specified path.

4. The method according to claim 1, 2 or 3, wherein the scanning operation and the copying operation both have a set time threshold value, and if the execution time exceeds the time threshold value, failure information is returned to the operation scheduling server; the job scheduling server selects a backup node for the failed scanning job or the failed copying job to execute again; and if the execution times of the same scanning operation or the same copying operation exceed a set threshold value, stopping executing the corresponding operation and generating error information to be written into a backup log.

5. The method according to claim 1, 2 or 3, wherein the job scheduling server selects the backup node to execute according to the execution time of the job and the state of the backup node; wherein the job is the scan job or the copy job.

6. A method according to claim 1, 2 or 3, wherein the disk array is mounted in a distributed file system and provides a uniform name.

7. The method of claim 1, 2 or 3, wherein the disks distributed on a plurality of disk arrays are virtualized into a logical storage space in a distributed file system manner, and the logical storage space is mounted to each backup node in a shared directory; each backup node in the backup cluster uses the underlying disk array by accessing the shared directory.