CN112214357A

CN112214357A - HDFS data backup and recovery system and backup and recovery method

Info

Publication number: CN112214357A
Application number: CN202011188471.1A
Authority: CN
Inventors: 朱拓之
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-12
Anticipated expiration: 2040-10-30
Also published as: CN112214357B

Abstract

The invention relates to a HDFS data backup and recovery system and a backup and recovery method, in the system, an HDFS client in an HDSF unit is correspondingly connected with a proxy client, a plurality of proxy clients are commonly connected with a virtual client, the virtual client is connected with a backup server, the proxy clients are also connected with the backup server, a storage medium is arranged in the backup server, and the backup server is used for creating a backup and recovery task, performing data interaction with the proxy clients and performing data management on the storage medium; the virtual client is used for positioning the backup recovery tasks to the plurality of agent clients; the agent client is used for executing a backup recovery task so as to read a backup object or write a recovery object; the HDFS client is used to receive and respond to read or write operations by the proxy client. Compared with the prior art, the method and the device can support various backup requirements and recovery requirements, can effectively manage backup data, and can improve the backup recovery efficiency through concurrent execution of tasks.

Description

HDFS data backup and recovery system and backup and recovery method

Technical Field

The invention relates to the technical field of data backup and recovery, in particular to a HDFS (Hadoop distributed File System) data backup and recovery system and a backup and recovery method.

Background

The fusion insight HD is a Distributed data processing System, provides large-capacity data storage, query and analysis capability for the outside, and the HDFS (Hadoop Distributed File System) is the bottom storage of the fusion insight HD and provides high fault tolerance and high throughput storage support for upper-layer application. How to efficiently ensure the daily data safety of the fusion instrumentation HD and ensure that the data recovery can be carried out in time when the system is abnormal or does not reach the expected result in the case of heavy operation, and the influence of the service is reduced to the minimum, which becomes the task of the current HDFS application.

The existing HDFS backup scheme is based on a snapshot technology provided by HDFS, backup data is reserved in an HDFS file system or stored in an external storage, and the method has the following defects:

1. backup data cannot be managed and utilized effectively;

2. in some scenes, only complete backup is supported, and selective recovery cannot be performed according to user requirements;

3. when there are multiple backup or restore objects, backup-restore efficiency is low.

Disclosure of Invention

The present invention aims to overcome the defects of the prior art and provide a HDFS data backup and recovery system and a backup and recovery method, so as to achieve the purposes of effectively managing backup data, supporting various backup requirements and recovery requirements, and improving backup and recovery efficiency.

The purpose of the invention can be realized by the following technical scheme: an HDFS data backup and recovery system comprises an HDFS unit provided with a plurality of HDFS clients, wherein the HDFS clients are correspondingly connected with a plurality of agent clients respectively, the agent clients are connected with a virtual client together, the virtual client is connected with a backup server, the agent clients are also connected with the backup server respectively, a storage medium used for storing backup data is arranged in the backup server, and the backup server is used for creating a backup and recovery task, performing data interaction with the agent clients and performing data management on the storage medium;

the virtual client is used for positioning the backup recovery tasks to a plurality of agent clients connected with the virtual client;

the proxy client is used for executing a backup recovery task so as to read an HDFS backup object or write an HDFS recovery object;

the HDFS client is used for receiving and responding to read or write operations provided by the proxy client.

Further, the HDFS client and the proxy client are both located on the same device.

An HDFS data backup method comprises the following steps:

a1, according to the data source, backup high-level parameter and backup type selected by the user, initiating the backup task by the backup server, and sending the corresponding backup instruction to a plurality of agent clients connected with the virtual client;

a2, the multiple proxy clients respectively obtain the current time of the HDFS from the corresponding HDFS clients;

a3, confirming the backup mode by a plurality of agent clients according to the received backup instruction;

a4, according to the backup mode, a plurality of agent clients respectively obtain backup time objects through corresponding HDFS clients, and transmit the current time point information of the HDFS to a backup server and write the information into a storage medium;

a5, each agent client generates a backup object list by analyzing the data source in the backup task;

a6, according to the backup object list, each agent client respectively and sequentially judges whether each backup object is backed up, filtered or not and is incremental data;

a7, each agent client transmits the backup object judged as incremental data to the corresponding HDFS client to read the file block of the backup object, transmits the file block to the backup server, writes the file block into a storage medium, and simultaneously stores the corresponding HDFS connection information and the time point copy complete mark information to complete the backup task.

Further, the data source to be protected in step a1 is specifically an HDFS file or directory.

Further, the step a4 specifically includes the following steps:

a41, if the initiated task is the full backup task, executing the step A44;

a42, if the initiated is an incremental backup task or a permanent incremental backup task, the agent client inquires the existing time point type from the backup server according to the backup task, if the full standby time point is found and the time point copy between the full standby time point and the current time of the HDFS is complete, the step A44 is executed, otherwise, the backup type is converted into full backup, and then the step A44 is executed;

a43, if the difference backup is initiated, the agent client inquires the existing time point type from the backup server according to the backup task, if the latest time is the full backup time point and the time point copy is complete, the step A44 is executed, otherwise, the backup type is converted into the full backup, and then the step A44 is executed;

a44, backup time object, transmitting the current time point information of HDFS to backup server, writing it into storage medium.

Further, the step a6 is specifically to pass the backup object through a load balancer to determine whether the backup object has been backed up;

the backup object is passed through a file filter to determine if the backup object is filtered.

Further, the step a7 specifically includes the following steps:

a71, the agent client transmits the backup object to the HDFS client, reads the file block of the backup object, transmits the file block to the backup server, and writes the file block into the storage medium;

a72, if the backup object is backed up successfully, marking the time point copy as complete, otherwise, marking the time point copy as incomplete;

and A73, after all the backup objects in the backup object list complete the backup and time point copy marking operation, storing the HDFS connection information and the time point copy complete marking information corresponding to the backup objects at the same time to complete the backup task.

An HDFS data recovery method comprises the following steps:

b1, according to the recovery time, the recovery data and the recovery position selected by the user, initiating a recovery task by the backup server, and sending corresponding recovery instructions to a plurality of proxy clients connected with the virtual client;

b2, determining time availability and data information needing to be recovered by a plurality of proxy clients through analyzing parameters according to the received recovery instructions;

b3, each agent client generates a recovery object list by analyzing the data source in the recovery task;

b4, according to the recovery object list, each proxy client respectively and sequentially judges whether each recovery object is recovered or not and whether the recovery object is filtered or not;

and B5, obtaining recovery data through data analysis and new path synthesis, and transmitting the recovery data to the HDFS clients by each agent client to complete recovery tasks.

Further, the step B4 is specifically to pass the recovery object through a load balancer to determine whether the recovery object has been backed up;

the restoration object is passed through a file filter to determine whether the restoration object is filtered.

Further, the step B5 specifically includes the following steps:

b51, obtaining a data file name needing to be recovered through data analysis;

b52, according to the set recovery task, if the recovery task requires to recover to the new path, splicing the new path and the data file name needing to be recovered into recovery data, otherwise, taking the data file name needing to be recovered as the recovery data;

and B53, each proxy client transmits the corresponding recovery data to the HDFS client, and writes the recovery data into the HDFS according to the coverage rule to complete the recovery task.

Compared with the prior art, the invention has the following advantages:

the backup server is responsible for creating a backup task, issuing a backup/recovery instruction to the agent client, receiving data returned by the agent client, interacting with the storage medium, reading/writing the data and managing the data in the storage medium, thereby achieving the purpose of effectively and periodically managing the backup task and the backup data.

The virtual client is connected with the plurality of agent clients, and the backup recovery tasks created by the backup server are positioned and distributed to the plurality of agent clients through the virtual client, so that concurrent backup/recovery of the plurality of clients is supported, a backup/recovery window is reduced, and backup/recovery efficiency is effectively improved.

Third, the invention can provide not only full backup, but also incremental backup, differential backup and permanent incremental backup by acquiring the last modification time of the backup object and the current time of the HDFS and combining the backup time point, thereby realizing the purpose of supporting various backup requirements.

The invention removes the file path when backing up the data, only backs up the file name and the backup object attribute, and only needs to splice the recovery path and the file name when recovering the data, so as to recover the file content and the file attribute, and ensure the file attribute after recovery to be consistent with that during backup, thereby supporting the recovery of original position, different position, original machine, different machine and even different file system when recovering the data.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention;

FIG. 2 is a schematic diagram of a data backup process according to the present invention;

FIG. 3 is a schematic diagram of a data recovery process according to the present invention;

FIG. 4 is a flow diagram illustrating the installation of a proxy client in an embodiment;

FIG. 5 is a flow of creating a virtual client in an embodiment;

FIG. 6 is a schematic diagram of an embodiment of a data backup process;

FIG. 7 is a diagram illustrating a data recovery process according to an embodiment;

the notation in the figure is: 1. HDFS unit, 11, HDFS client, 2, proxy client, 3, virtual client, 4, backup server, 41 and storage medium.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, an HDFS data backup and recovery system includes an HDFS unit 1 having a plurality of HDFS clients 11, the plurality of HDFS clients 11 are respectively and correspondingly interconnected with a plurality of agent clients 2, the plurality of agent clients 2 are commonly interconnected with a virtual client 3, the virtual client 3 is interconnected with a backup server 4, the plurality of agent clients 2 are also respectively interconnected with the backup server 4, a storage medium 41 for storing backup data is disposed in the backup server 4, the backup server 4 is configured to create a backup and recovery task, perform data interaction with the agent clients 2, and perform data management on the storage medium 41;

the virtual client 3 is used for positioning the backup and recovery tasks to a plurality of agent clients 2 connected with the virtual client;

the agent client 2 is used for executing a backup recovery task to read an HDFS backup object or write an HDFS recovery object;

the HDFS client 11 is used for receiving and responding to the read or write operation provided by the proxy client 2;

the HDFS client 11 and the proxy client 2 are both located on the same device.

Specifically, the backup server 4 is used as a management console of backup software, and is used for managing all resources, including the virtual client 3, the agent client 2 and the storage medium 41, and is responsible for creating a backup task, issuing a backup/recovery instruction to the agent client 2, receiving data returned by the agent client 2, interacting with the storage medium 41, reading/writing data, clearing out-of-date data in the storage medium 41, and automatically clearing out copies exceeding a set value by setting copy retention policies, such as copy number and retention time, in the backup server, so as to provide a utilization rate of a backup storage space, and also manually deleting unnecessary copies;

the storage medium 41 is a data storage unit of backup software for storing backup data;

the virtual client 3 is a set of physical agent clients, can ensure the concurrent execution of multiple clients of the backup/recovery task, the virtual client 3 is used for initiating the backup and recovery task, is equivalent to a virtual client associated with the backup task recovery task, manages the initiation of the whole task and the execution result, finds the corresponding agent client 2 to initiate the task through the virtual client 3, and the interaction of the task is that the agent client 2 interacts with the backup server 4;

the agent client 2 is used as an agent of backup software on the client, is responsible for interacting with the backup server 4, receiving and responding to commands issued by the backup server 4, and returns execution results to the backup server 4; interacting with an HDFS client 11, reading an HDFS backup object, and writing an HDFS recovery object;

the HDFS client 11 is implemented based on a Hadoop client provided by the fusion instrumentation HD, is located in the same device as the proxy client 2, receives and responds to the read/write operation provided by the proxy client 2, forwards the corresponding operation to the HDFS, and returns the response of the HDFS to the proxy client 2.

The above system is applied to practice, and the data backup process is shown in fig. 2, and includes the following steps:

a1, according to the data source (HDFS file or catalog) that needs protection, backup high-level parameter and backup type selected by the user, initiating the backup task by the backup server, and sending the corresponding backup instruction to a plurality of agent clients connected with the virtual client;

a4, according to the backup mode, the multiple agent clients respectively obtain backup time objects through the corresponding HDFS clients, and transmit the current time point information of the HDFS to the backup server and write the information into the storage medium, specifically:

a41, if the initiated task is the full backup task, executing the step A44;

a44, backing up the time object, transmitting the current time point information of the HDFS to a backup server, and writing the information into a storage medium;

a6, according to the backup object list, each agent client respectively and sequentially judges whether each backup object has been backed up, filtered or not, and is incremental data, wherein, the backup object is passed through a load balancer to judge whether the backup object has been backed up;

the backup object passes through a file filter to judge whether the backup object is filtered or not;

a7, each agent client transmits the backup object judged as incremental data to the corresponding HDFS client to read the file block of the backup object, transmits the file block to the backup server, writes the file block into the storage medium, and simultaneously stores the corresponding HDFS connection information and the time point copy complete mark information to complete the backup task, specifically:

The above system is applied to practice, and the data recovery process is shown in fig. 3, and includes the following steps:

b4, according to the recovery object list, each agent client end respectively and sequentially judges whether each recovery object is recovered and filtered, and similarly, the recovery object passes through a load balancer to judge whether the recovery object is backed up;

passing the restored object through a file filter to determine whether the restored object is filtered;

b5, obtaining recovery data through data analysis and new path synthesis, and transmitting the recovery data to the HDFS client by each proxy client to complete recovery tasks, specifically:

b51, obtaining a data file name needing to be recovered through data analysis;

In the invention, the whole backup and recovery system consists of an agent client, a storage medium, a backup server and an HDFS client, and the HDFS unit and the backup service are transmitted through a TCP/IP protocol process.

The backup/recovery task execution result is determined by all the agent clients associated with the virtual client, and when all the agent clients fail, the task fails, otherwise, the task is successful or partially successful.

In order to construct a backup recovery system, a proxy client needs to be installed and a virtual client needs to be created, wherein the proxy client is installed as shown in fig. 4, the virtual client is created as shown in fig. 5, the specific execution flows of backup recovery are respectively shown in fig. 6 and fig. 7, the HDFS client and the proxy client are in the same machine, a Hadoop client provided by fusion instrumentation HD is needed to be installed in advance, the IP of the HDFS cluster NameNode needs to be the IP of the main NameNode, the stand-alone NameNode needs to be in an active mode, and a user needs to have a corresponding management authority of the client proxy and a corresponding storage medium use authority and configures correct information of the NameNode IP username, Kerberos and the like for the HDFS needing backup.

As shown in fig. 4, at the time of proxy client installation, the fusion instrumentation HD option must be selected:

1. the user starts to execute and install the client program;

2. selecting a supporting fusion instrumentation HD installation option;

3. inputting Hadoop Client and component _ env _ C _ example script positions;

4. executing the parameters provided in the step 3, generating an environment variable file, and executing the step 5;

5. and after the installation is completed, supporting the backup of the fusion insight HD HDFS if the installation is successful, and not supporting the backup of the fusion insight HD HDFS if the installation is failed.

As shown in fig. 5, the virtual client creation flow is as follows:

1. creating a virtual client, inputting a NameNode ip and a user name, and executing the step 2;

2. selecting physical clients needing to be bound, setting the kerbtickcachepath of each client, and executing the step 3;

3. submitting parameters and executing the step 4;

4. checking the legality of the parameters, connecting the HDFS, if the parameters are checked to be passed, executing the step 5, otherwise, executing the step 6;

5. the creation is successful;

6. and (5) failing to create, and prompting an error.

As shown in fig. 6, the backup process is as follows:

1. a user selects a data source (HDFS file or directory) needing protection, selects a backup high-level parameter and a backup type, initiates backup, and sends a backup instruction to a backup agent client bound with a virtual client;

2. each backup agent acquires the current time of the HDFS and executes the step 3;

3. each backup agent receives the backup instruction and confirms the backup type:

3.1 if the initiating is full standby, executing step 4;

3.2 if the incremental backup is initiated, the backup agent inquires the existing time point type in the backup service according to the task parameters, if the full-backup time point is found and the time point copy between the full-backup time point and the current time is complete, executing the step 4, otherwise, converting the backup type into full backup, and executing the step 4;

3.3 if the difference backup is initiated, the backup agent inquires the existing time point type in the backup service according to the task parameters, if the latest time is a complete time point and the time point copy is complete, the step 4 is executed, otherwise, the backup type is converted into full backup, and the step 4 is executed;

3.4 if the initiated permanent backup, executing 3.2 steps;

4. backing up the time object, writing the time point information into the storage medium, and executing the step 5;

5. analyzing a data source, generating a backup object list, and executing the step 6;

6. the backup object passes through a load balancer (under the condition of multi-client concurrency) and a file filter (under the condition of starting file filtering), and executes the step 7, otherwise, executes the step 5;

7. backing up the object, if the object is incremental backup, judging whether the object can have incremental data, if the object can be incremental, executing the step 8, otherwise returning to the step 5;

8. transferring the backup object to an HDFS Client, reading a file block of the backup object through the HDFS, writing the file block into a backup memory, and executing the step 9;

9. if the backup object is successfully backed up, marking the copy as complete, executing step 5, otherwise, marking the copy as incomplete, and executing step 11;

10. if all the backup objects are backed up, executing step 11;

11. storing special metadata (HDFS connection information) and copy integrity, and finishing the current backup agent client task.

As shown in fig. 7, the recovery flow is as follows:

1. selecting time needing to be recovered and a recovery file or a directory by a user, selecting a recovery position, and initiating recovery;

2. the agent client receives the recovery instruction, analyzes the parameters, determines the availability of time and the data information needing to be recovered, and executes the step 3;

3. starting a data source reader, analyzing a data source, generating a recovery object, and executing the step 4;

4. generating a recovery object list, sequentially taking out backup objects, and executing the step 5;

5. the recovery object is judged by a load balancer (under the condition of multi-client concurrency) and a file filter (under the condition of starting file filtering), if the recovery object passes through the load balancer, the step 6 is executed, and if the recovery object does not pass through the file filter, the step 4 is executed;

6. according to the recovery destination category, new path synthesis, data recovery, and execution of step 7, in this embodiment, the recovery destination category includes an HDFS file system and extx under a Linux file system;

7. the proxy client sends the data to the HDFS client, the HDFS client writes the data into the HDFS according to the coverage rule, if the data is written successfully, the step 5 is executed, and if the data is written unsuccessfully, the step 8 is executed;

8. and the current proxy client ends the recovery task.

In summary, the present invention can provide full backup, incremental backup, differential backup, and permanent increment based on the JNI interface data provided by the HDFS, and can flexibly configure a backup object because a snapshot technique is not used.

Claims

1. The HDFS data backup and recovery system is characterized by comprising an HDFS unit (1) provided with a plurality of HDFS clients (11), wherein the HDFS clients (11) are respectively and correspondingly connected with a plurality of agent clients (2), the agent clients (2) are commonly connected with a virtual client (3), the virtual client (3) is connected with a backup server (4), the agent clients (2) are also respectively connected with the backup server (4), a storage medium (41) for storing backup data is arranged in the backup server (4), and the backup server (4) is used for creating a backup and recovery task, performing data interaction with the agent clients (2) and performing data management on the storage medium (41);

the virtual client (3) is used for positioning backup and recovery tasks to a plurality of agent clients (2) connected with the virtual client;

the proxy client (2) is used for executing a backup recovery task to read an HDFS backup object or write an HDFS recovery object;

the HDFS client (11) is used for receiving and responding to read or write operations provided by the proxy client (2).

2. The HDFS data backup-restore system according to claim 1, wherein the HDFS client (11) and the proxy client (2) are located on the same device.

3. An HDFS data backup method using the system of claim 1, comprising the steps of:

4. The HDFS data backup method according to claim 3, wherein the data source to be protected in step a1 is specifically an HDFS file or directory.

5. The HDFS data backup method according to claim 3, wherein the step a4 specifically includes the following steps:

a41, if the initiated task is the full backup task, executing the step A44;

6. The HDFS data backup method according to claim 3, wherein the step a6 is specifically to pass the backup object through a load balancer to determine whether the backup object has been backed up;

7. The HDFS data backup method according to claim 5, wherein the step a7 specifically includes the following steps:

8. An HDFS data recovery method using the system of claim 1, comprising the steps of:

9. The HDFS data recovery method according to claim 8, wherein the step B4 is specifically to pass the recovery object through a load balancer to determine whether the recovery object has been backed up;

10. The HDFS data recovery method according to claim 8, wherein the step B5 specifically includes the following steps:

b51, obtaining a data file name needing to be recovered through data analysis;