CN117370080B - Data backup and data recovery method, system, equipment and medium for Hive - Google Patents

Data backup and data recovery method, system, equipment and medium for Hive Download PDF

Info

Publication number
CN117370080B
CN117370080B CN202311647374.8A CN202311647374A CN117370080B CN 117370080 B CN117370080 B CN 117370080B CN 202311647374 A CN202311647374 A CN 202311647374A CN 117370080 B CN117370080 B CN 117370080B
Authority
CN
China
Prior art keywords
data
backup
metadata
service data
snapshot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311647374.8A
Other languages
Chinese (zh)
Other versions
CN117370080A (en
Inventor
薛鹏
蔡涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Mulangyun Technology Co ltd
Original Assignee
Shenzhen Mulangyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Mulangyun Technology Co ltd filed Critical Shenzhen Mulangyun Technology Co ltd
Priority to CN202311647374.8A priority Critical patent/CN117370080B/en
Publication of CN117370080A publication Critical patent/CN117370080A/en
Application granted granted Critical
Publication of CN117370080B publication Critical patent/CN117370080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data backup and data recovery method, a system, equipment and a medium for Hive, wherein the method comprises the following steps: based on a first storage path of business data in a Hadoop Distributed File System (HDFS), obtaining a second storage path of snapshot data generated based on the business data in the HDFS, and obtaining snapshot data of the business data from the second storage path; inquiring the HQL statement of the Hive build table corresponding to the service data, and generating a metadata backup file recorded with the HQL statement; and backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system. The method and the device support database and table level backup, do not need to directly access the relational database storing metadata for data backup and recovery, ensure that the metadata cannot be tampered, support incremental backup and incremental recovery, reduce storage space occupation and improve data backup and recovery efficiency.

Description

Data backup and data recovery method, system, equipment and medium for Hive
Technical Field
The present invention relates to the field of internet technologies and data storage technologies, and in particular, to a method, a system, an apparatus, and a medium for Hive data backup and data recovery.
Background
Apache Hive is a data warehouse tool based on Hadoop, and can map structured data files stored in HDFS into a database table and provide simple SQL query functions. At present, most of service data backup is to copy and split the service data into Map tasks by using DistCp and performing distributed copying. For metadata backup, the metadata is backed up mainly by using a relational database backup and restore function.
However, in the process of data restoration and backup, the above scheme requires the metadata and the service data to be backed up separately. In the process of service data backup, the following problems exist: firstly, distCp only supports HDFS and S3 at present and has poor compatibility with NFS; secondly, the DistCp task runs in the Hadoop cluster, and cluster resources are required to be consumed. Also, for metadata backup, the above scheme has disadvantages: firstly, the metadata is strongly related to a relational database storing metadata, and is difficult to be compatible with various relational databases; secondly, only the whole database backup recovery is supported, and the fine-granularity backup recovery of the database and the table level cannot be realized; thirdly, for security reasons, direct access to relational databases storing metadata is generally not allowed.
Therefore, how to improve the compatibility of data backup and reduce the amount of backed-up data, so as to improve the efficiency of data backup and recovery is a technical problem to be solved at present.
Disclosure of Invention
The embodiment of the application provides a data backup and data recovery method, a system, equipment and a medium for Hive, which are used for solving or partially solving the problems of improving the compatibility of data backup and reducing the data volume of backup so as to improve the efficiency of data backup recovery.
A data backup method for Hive, comprising:
based on a first storage path of business data in a Hadoop Distributed File System (HDFS), obtaining a second storage path of snapshot data generated based on the business data in the HDFS, and obtaining snapshot data of the business data from the second storage path;
inquiring the HQL statement of the Hive build table corresponding to the service data, and generating a metadata backup file recorded with the HQL statement;
and backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system.
The present application may be further configured in a preferred example to: backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system, comprising:
acquiring a backup record;
and carrying out incremental backup on the service data based on the backup record.
The present application may be further configured in a preferred example to: the backup record comprises a first data path, a snapshot file size and a snapshot last modification time;
based on the backup record, performing incremental backup on the service data, including:
if the first data path is the same and the snapshot file size is the same as the last snapshot modification time, recording metadata of a service data file where the service data is located;
if the first data path is the same, the snapshot file size is not the same as any one of the snapshot last modification time, the service data file where the service data is located is backed up, and metadata of the service data file is recorded.
The present application may be further configured in a preferred example to: backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system, comprising:
and carrying out concurrent backup on the snapshot data by using a RESTful interface provided by the WebHDFS and adopting a multithreading mode, and backing up the business data to a backup system.
A data recovery method for Hive, comprising:
acquiring a service data recovery request or a metadata recovery request;
based on the service data recovery request, acquiring a data path of the service data stored in the HDFS, and recovering the service data in the backup system into the HDFS through a data interface provided by the WebHDFS;
based on the metadata recovery request, the HQL file in the backup system is obtained, and the metadata in the HQL file is subjected to data processing by a Hive algorithm, so that the metadata recovery is realized.
The present application may be further configured in a preferred example to: the data processing is carried out on the metadata in the HQL file through the Hive algorithm, and the method is used for realizing the recovery of the metadata and comprises the following steps:
and sequentially executing a delete table recorded in the HQL file, a create table and HQL sentences of the correction partition by the Hive algorithm, and associating the table where the metadata is positioned with the service data in the HDFS by the correction partition for realizing the recovery of the metadata.
The second purpose of the application is to provide a data backup and data recovery system for Hive.
The second object of the present application is achieved by the following technical solutions:
a data backup system for Hive, comprising:
the snapshot data module is used for acquiring a second storage path of the snapshot data generated based on the service data in the HDFS based on a first storage path of the service data in the Hadoop distributed file system HDFS, and acquiring the snapshot data of the service data from the second storage path;
the metadata backup file generation module is used for inquiring the HQL statement of the Hive build table corresponding to the service data and generating a metadata backup file recorded with the HQL statement;
and the business data and metadata backup module is used for backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system.
A data recovery system for Hive, comprising:
the acquisition recovery request module is used for acquiring a service data recovery request or a metadata recovery request;
the service data recovery module is used for acquiring a data path of the service data stored in the HDFS based on the service data recovery request and recovering the service data in the backup system into the HDFS through a data interface provided by the WebHDFS;
the metadata recovery module is used for acquiring the HQL file in the backup system based on the metadata recovery request, and performing data processing on the metadata in the HQL file through the Hive algorithm to realize metadata recovery.
The third object of the present application is to provide an electronic device.
The third object of the present application is achieved by the following technical solutions:
an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the data backup method for Hive described above when executing the computer program and/or the processor implementing the data recovery method for Hive described above when executing the computer program.
A computer readable storage medium storing a computer program which when executed by a processor implements the data backup method for Hive described above, and/or which when executed by a processor implements the data recovery method for Hive described above.
In summary, the present application includes the following beneficial technical effects:
according to the data backup and data recovery method for Hive, according to the first storage path of business data in the Hadoop distributed file system HDFS, the second storage path of snapshot data generated based on the business data in the HDFS is obtained, and the snapshot data of the business data is obtained from the second storage path; inquiring the HQL statement of the Hive build table corresponding to the service data, and generating a metadata backup file recorded with the HQL statement; and backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system. Acquiring a service data recovery request or a metadata recovery request; based on the service data recovery request, acquiring a data path of the service data stored in the HDFS, and recovering the service data in the backup system into the HDFS through a data interface provided by the WebHDFS; based on the metadata recovery request, the HQL file in the backup system is acquired, and the metadata is subjected to data processing by the Hive algorithm so as to realize the recovery of the metadata. The backup and recovery of the data do not need extra HDFS storage space, and the relational database for storing the metadata is not needed to be accessed directly, so that the metadata is ensured not to be tampered. In addition, the method and the device support incremental backup and incremental recovery, reduce storage space occupation, support backup recovery of database levels and table levels, and improve data backup and recovery efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flowchart of a data backup method for Hive according to an embodiment of the present application;
FIG. 2 is a process diagram of a data backup method for Hive according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating an overall data backup method for Hive according to an embodiment of the present application;
FIG. 4 is a flowchart of a data recovery method for Hive according to an embodiment of the present application;
FIG. 5 is a process diagram of a data recovery method for Hive according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating an overall data recovery method for Hive according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a data backup system for Hive according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating an overall data backup system for Hive according to an embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a data recovery system for Hive according to an embodiment of the present application;
FIG. 10 is a schematic diagram of an overall data recovery system for Hive according to an embodiment of the present disclosure;
fig. 11 is a schematic diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other forms than those described herein and similar modifications can be made by those skilled in the art without departing from the spirit of the application, and therefore the application is not to be limited to the specific embodiments disclosed below.
The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this application refers to and encompasses any or all possible combinations of one or more of the listed items.
The terms "comprising" and "having" and any variations thereof in the description of the invention and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion; the terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
For ease of understanding, related terms and related concepts related to the embodiments of the present application are described below.
Hadoop cluster resources refer to a generic term for all hardware and software resources that make up a Hadoop cluster, including network bandwidth, CPU, server nodes, disk space, and the like. The Hadoop cluster resource is mainly responsible for running application programs, processing data, carrying out communication, coordination and other works.
The HDFS resources refer to data blocks and metadata stored in the HDFS distributed file system. The HDFS resource is mainly responsible for managing and storing a large amount of distributed data, and guaranteeing the reliability and safety of the data.
The embodiment of the application provides a data backup method for Hive, and the main flow of the method is described as follows:
referring to fig. 1, S10, based on a first storage path of service data in a Hadoop distributed file system HDFS, a second storage path of snapshot data generated based on the service data in the HDFS is acquired, and snapshot data of the service data is acquired from the second storage path.
Apache Hadoop is an open source software framework supporting data-intensive distributed applications and published under Apache2.0 licensing protocol, which helps solve the data and computation-intensive problems using a network of many computers. The core modules are divided into storage and computation modules, the former being called Hadoop distributed file system (HDFS, hadoop Distributed File System) and the latter being the MapReduce computation model.
MapReduce is a software architecture proposed by Google for parallel operation of large-scale datasets. Current software implementations specify a "Map" function to Map a set of key-value pairs to a new set of key-value pairs, and a concurrent "Reduce" function to ensure that each of all mapped key-value pairs share the same set of keys.
Wherein an HDFS snapshot is a read-only copy at a certain moment in time. The snapshot may be a subtree of the filesystem or the entire filesystem. The snapshot is mainly used for data backup, user misoperation protection and disaster recovery.
The snapshot in the application is to build an index of a file system, and each time the file is updated, the file is not really changed, but a space is newly opened up for storing the changed file.
The creation of the snapshot is instantaneous, at the cost of O (1), depending on the time the child node scans the file directory. And occupying a small part of memory if and only if the file is updated under the file directory of which the snapshot is made, wherein the occupied memory is O (M), and M is the number of changed files or directories. When a snapshot is newly built, the blocks in the Datanode cannot be copied, only the list and the size information of the file blocks are recorded in the snapshot, the snapshot does not influence normal HDFS operation, the change of the data after the snapshot is recorded in reverse order according to the time sequence, the user accesses the data which is the current latest, and the content in the snapshot is the content of the file minus the content of the current file when the snapshot is created at the time point.
Specifically, the embodiment establishes an HDFS snapshot according to the service data path in the Hive table, and queries the data path of the service data in the HDFS snapshot based on the HDFS snapshot. Assuming that/foo is a snapshot-capable directory,/foo/bar is a file or directory under/foo,/foo has a snapshot S0, then path/foo/. Snapshot/S0/bar corresponds to a snapshot copy of/foo/bar, i.e., path adds/. Snapshot/snapshot name.
The step S10 has the functions of recovering important data, preventing user error operation, and ensuring consistency of the backup data.
S20, inquiring the HQL statement of the Hive build table corresponding to the service data, and generating a metadata backup file recorded with the HQL statement.
Wherein, the HQL statement (Hive Query Language) of Hive build table is closely related to metadata. In Hive, metadata is an important part for describing table structure and column information, and HQL statements are the main means for creating table structure and inserting data. That is, metadata is descriptive information of Hive tables, and HQL statements are an important means to manipulate these metadata.
Specifically, the present embodiment may CREATE a TABLE of a specified name through CREATE TABLE, and throw an exception if the TABLE of the same name already exists. The user may ignore this anomaly with the IF NOT EXISTS option. The embodiment can also allow the user to create an external table, default to an internal table, through the EXTERNEL key. The external table must specify a path (LOCATION) pointing to the actual data at the same time when the external table is built, when the Hive creates the internal table, the data will be moved to the path pointed by the data warehouse, if the external table is created, only the path where the data is located is recorded, and no change is made to the position of the service data. When deleting the table, the metadata and data of the internal table are deleted together, while the external table only deletes the metadata, and does not delete the service data.
The embodiment may also determine whether the table is a partition table BY partitioning the table BY the PARTITIONED BY. In this embodiment, for each table (table) or partition, hive may be further organized into buckets, that is, the buckets are more fine-grained data range partitions, and Hive uses hash of the column values and then divides the number of buckets to determine which bucket the record is stored in.
In this embodiment, the HQL statement is saved in the HQL file, that is, the metadata of the Hive table is saved in the HQL file, then the HQL file is saved in the backup system, and the metadata is saved under the metadata sub-directory.
The function of step S20 is that the metadata backup is irrelevant to the relational database storing metadata in the data backup process, so as to improve the compatibility of the data backup.
S30, backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system.
Specifically, in this embodiment, the service data in the second storage path in the HDFS snapshot is backed up to the backup system through the RESTful interface provided by WebHDFS. And storing the business data under the data sub-directory. The service data is data in the Hive table, that is, actual data stored in the HDFS or HBase (distributed storage system). And, the service data may provide user-based information, such as metadata of service description information of the recorded data item, which can help the user use the data. WebHDFS provides a RESTful interface to access HDFS, built-in components, and default on. WebHDFS enables clients outside the cluster to access the HDFS without installing Hadoop and Java environments, and the clients are not limited by language.
Specifically, as shown in fig. 2, in this embodiment, service data in the HDFS is backed up to the backup system through the WebHDFS interface.
The function of step S30 is to store the service data under the data subdirectory according to the second storage path, without requiring additional HDFS storage space, and without running in the Hadoop cluster, and without consuming cluster resources.
According to the data backup method for Hive, according to the first storage path of business data in the Hadoop Distributed File System (HDFS), the second storage path of snapshot data generated based on the business data in the HDFS is obtained, and the snapshot data of the business data is obtained from the second storage path; inquiring the HQL statement of the Hive build table corresponding to the service data, and generating a metadata backup file recorded with the HQL statement; and backing up the snapshot data and the metadata backup files to the backup system by calling an interface provided by the backup system so as to realize the backup of the data. The method does not need extra HDFS storage space for data backup and directly accesses the relational database for storing metadata, thereby ensuring that the metadata cannot be tampered. In addition, the method supports incremental backup, reduces the occupied storage space and improves the efficiency of data backup.
Referring to fig. 3, in some possible embodiments, step S20, namely, querying the HQL statement of the Hive build table corresponding to the service data, generates a metadata backup file recorded with the HQL statement, includes:
s21, backing up metadata in the relational database to a backup system through SHOW CREATE TABLE commands provided by Hive.
Metadata can be understood simply as the smallest unit of data. Metadata is used primarily to store the table name, table columns and partitions, and their attributes. The attributes of the table, such as whether the table is an external table or not, and the directory in which the data of the table is located, are usually stored in a relational database. And, metadata may be data that states its elements or attributes (name, size, data type, etc.) or its structure (length, field, data column, etc.), or its related data (where, how to contact, owner, etc.).
In some possible embodiments, step S30, that is, backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system, includes:
s31, carrying out concurrent backup on snapshot data by using a RESTful interface provided by the WebHDFS and adopting a multithreading mode, wherein the concurrent backup is used for backing up service data to a backup system.
Wherein, multithreading means that a plurality of executing parts in one application program can be executed simultaneously, and a plurality of tasks can be completed at the same time, so as to improve the utilization efficiency of resources. The RESTful interface provided by WebHDFS and the SHOW CREATE TABLE command provided by Hive have life cycles of creation, runnability, blocking, running, death and the like, each thread is inevitably blocked in the running process, waiting for the time to mature and then running, at the moment, resources of the CPU cannot be wasted, and one thread needs to be started again to do other things, just as in the case of boiling water, tea sets and tea leaves can be prepared in the process of waiting for boiling water, tea can be made after boiling water, and if single thread operation is performed, the water is cooled after being prepared.
Specifically, the embodiment obtains information such as RESTful interface of the backup system and related parameter options for data backup, where the information may be obtained in a document or developer guide of the backup system. Then, the embodiment can export the service data from the HDFS to the local file system by calling the GET/webhdfs/v1/pathop-OPEN interface, and then upload the exported service data to the backup system. In order to ensure that the service data is backed up to the backup system, the embodiment can query the backup state of the service data through the RESTful interface of the backup system.
The function of the step S31 is that the multi-thread concurrent backup can effectively prevent the performance loss caused by thread blocking and improve the efficiency of data backup.
In some possible embodiments, step S30, that is, backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system, includes:
s32, acquiring a backup record.
S33, based on the backup record, incremental backup is carried out on the service data.
The backup record includes a first data path, a snapshot file size, a snapshot last modification time, and the like.
Specifically, the client determines whether the service data file needs to be backed up by comparing the first data path of the metadata of all backed up files recorded in the last backup, the size of the snapshot file and the last modification time of the snapshot. If the first data path, the snapshot file size and the last snapshot modification time are the same, the service data file is not modified, and since the service data file is already backed up, the service data file does not need to be backed up again. If any one of the snapshot file size and the snapshot last modification time is different in the same first data path, the fact that the service data file needs to be backed up is indicated, and metadata of the service data file, namely incremental backup, is recorded.
The function of step S32 and step S33 is to support incremental backup, reduce the occupation of storage space, and improve the backup efficiency.
In some possible embodiments, the backup record includes a first data path, a snapshot file size, and a snapshot last modification time, step S33, performing incremental backup on the business data based on the backup record, including:
s331, if the first data path is the same and the snapshot file size is the same as the last snapshot modification time, recording metadata of the service data file where the service data is located.
And S332, if the first data paths are the same, if any one of the snapshot file size and the snapshot last modification time is different, backing up the service data file where the service data is located, and recording the metadata of the service data file.
The embodiment of the application provides a data recovery method for Hive, and the main flow of the method is described as follows:
referring to fig. 4, S40, a service data restoration request or a metadata restoration request is acquired.
S50, based on the service data recovery request, acquiring a data path of the service data stored in the HDFS, and recovering the service data in the backup system into the HDFS through a data interface provided by the WebHDFS.
In this embodiment, the path of the service data stored in the HDFS is queried through the Hive command, and then the service data in the data subdirectory in the backup system is multithreaded and restored to the HDFS through the data interface provided by the WebHDFS, see fig. 5.
Specifically, the client determines whether the service data file needs to be restored by comparing metadata of all backed up files recorded in the backup selected for restoration, including, for example, a file path, a file size, a last modification time, and the like. If the file size and the last modification time of the same file path are the same, the file is not modified and is not restored, and if the file size and the last modification time of the same file path are different, the file is modified and needs to be restored, namely incremental restoration is performed.
The function of step S50 is that the data recovery does not need extra HDFS storage space, and supports incremental recovery, reducing storage space occupation, and improving recovery efficiency.
S60, based on the metadata recovery request, acquiring the HQL file in the backup system, and performing data processing on the metadata in the HQL file through a Hive algorithm to recover the metadata.
Specifically, referring to fig. 5, in this embodiment, by obtaining an HQL file in a metadata subdirectory in a backup system, performing HQL statements in the file in Hive, and further performing data processing on metadata, for example, sequentially performing an HQL statement of deleting a table, creating a table, and modifying a partition through Hive algorithm, and associating the table where the metadata is located with service data in an HDFS through the modifying partition, so as to achieve metadata recovery.
The step S60 has the effect of not directly accessing the relational database storing the metadata, ensuring that the metadata cannot be tampered and improving the reliability and safety of data recovery.
Referring to fig. 6, in some possible embodiments, step S60, that is, performing data processing on metadata in the HQL file by using Hive algorithm, is used to implement metadata recovery, including:
s61, sequentially executing a delete table recorded in the HQL file, a create table and HQL sentences of the correction partition by the Hive algorithm, and associating the table where the metadata is located with the service data in the HDFS by the correction partition for recovering the metadata.
In this embodiment, the Hive algorithm sequentially executes the HQL statement of deleting the table, creating the table and correcting the partition, specifically, in this embodiment, the metadata of the table cleaning table is deleted first, then the metadata of the table restoration table is created again, and finally the correction partition correlates the table with the service data in the HDFS.
The function of step S61 is to support recovery at the database level and at the table level, and to improve compatibility, applicability, and reliability of data recovery.
According to the data backup and data recovery method for Hive, the multithread concurrent backup can effectively prevent performance loss caused by thread blocking and improve the data backup efficiency; and the incremental backup is supported, the occupied storage space is reduced, and the backup efficiency is improved. In addition, the data recovery in the application does not need extra HDFS storage space, supports incremental recovery, reduces the occupation of the storage space, and improves the recovery efficiency. According to the method and the device, the relational database for storing the metadata is not required to be accessed directly, so that the metadata is guaranteed not to be tampered, and the reliability and the safety of data recovery are improved. In addition, the method and the device also support recovery of database level and table level, and improve compatibility, applicability and reliability of data recovery.
In another embodiment of the present application, a data backup system for Hive is disclosed.
Referring to fig. 7, a data backup system for Hive, comprising:
the snapshot data module 10 is configured to obtain, based on a first storage path of the service data in the Hadoop distributed file system HDFS, a second storage path of the snapshot data generated based on the service data in the HDFS, and obtain snapshot data of the service data from the second storage path;
the metadata backup file generating module 20 is configured to query HQL statements in Hive build tables corresponding to service data, and generate metadata backup files recorded with the HQL statements;
the service data and metadata backup module 30 is configured to backup the snapshot data and metadata backup file to the backup system by calling an interface provided by the backup system.
Further, as shown in fig. 8, the metadata backup file module 20 is generated, including:
the metadata backup sub-module 21 is configured to backup metadata in the relational database to the backup system through SHOW CREATE TABLE command provided by Hive.
Further, as shown in fig. 8, the service data and metadata backup module 30 includes:
the concurrency backup sub-module 31 is configured to perform concurrency backup on the snapshot data through a RESTful interface provided by WebHDFS and in a multithreading manner, and is configured to backup the service data to a backup system.
Further, as shown in fig. 8, the service data and metadata backup module 30 includes:
the backup record obtaining module 32 is configured to obtain a backup record.
The incremental backup module 33 is configured to perform incremental backup on the service data based on the backup record.
Further, as shown in fig. 8, the incremental backup module 33 for service data includes:
and the data recording sub-module 331 is configured to record metadata of a service data file where the service data is located if the first data path is the same and the snapshot file size is the same as the last modification time of the snapshot.
And the file backup sub-module 332 is configured to backup the service data file in which the service data is located and record metadata of the service data file if any one of the size of the snapshot file and the last modification time of the snapshot is different if the first data path is the same.
In another embodiment of the present application, a data recovery system for Hive is disclosed.
Referring to fig. 9, a data recovery system for Hive, comprising:
an acquisition recovery request module 40, configured to acquire a service data recovery request or a metadata recovery request;
the service data recovery module 50 is configured to obtain a data path of the service data stored in the HDFS based on the service data recovery request, and recover the service data in the backup system to the HDFS through a data interface provided by the WebHDFS;
the metadata recovery module 60 is configured to obtain the HQL file in the backup system based on the metadata recovery request, and perform data processing on metadata in the HQL file by using Hive algorithm, so as to implement metadata recovery.
Further, as shown in fig. 10, the metadata recovery module 60 includes:
the association sub-module 61 is configured to sequentially execute the deletion table, the creation table, and the HQL statement of the correction partition recorded in the HQL file by using the Hive algorithm, and associate the table where the metadata is located with the service data in the HDFS by using the correction partition, so as to implement metadata recovery.
The data backup and data recovery system for Hive according to the present embodiment may implement the steps of the foregoing embodiments due to the functions of each module and the logic connection between each module, so that the same technical effects as those of the foregoing embodiments may be achieved, and the relevant description of the steps of the foregoing data backup and data recovery method for Hive may be found in principle analysis, which is not repeated herein.
For specific limitations on the data backup and data recovery system for Hive, reference may be made to the above limitations on the data backup and data recovery method for Hive, and no further description is given here. The various modules in the data backup and data recovery system for Hive described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the device, or may be stored in software in a memory in the device, so that the processor may call and execute operations corresponding to the above modules.
In an embodiment, an electronic device is provided, which may be a monitoring terminal, and an internal structure diagram thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a non-volatile medium, an internal memory. The non-volatile medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile media. The database of the electronic equipment is used for storing data to be stored in the data backup and data recovery method for Hive. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a data backup and data recovery method for Hive.
In an embodiment, an electronic device is provided, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the computer program to implement the data backup and data recovery method for Hive of the above embodiment, for example, step S10 to step S30 shown in fig. 1 and step S40 to step S60 shown in fig. 4. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the data backup and data recovery system for Hive in the above embodiments, such as the functions of the modules 10 to 30 shown in fig. 7 and the functions of the modules 40 to 60 shown in fig. 9. To avoid repetition, no further description is provided here.
In an embodiment, a computer readable storage medium is provided, on which a computer program is stored, where the computer program when executed by a processor implements the data backup and data recovery method for Hive of the above embodiment, or where the computer program when executed by a processor implements the functions of each module/unit in the data backup and data recovery system for Hive of the above system embodiment. To avoid repetition, no further description is provided here.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable medium that when executed comprises the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the system is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (7)

1. A data backup method for Hive, comprising:
based on a first storage path of business data in a Hive table in a Hadoop Distributed File System (HDFS), acquiring a second storage path of snapshot data generated based on the business data in the HDFS, and acquiring the snapshot data of the business data from the second storage path, wherein the Hive table comprises an internal table and an external table; moving service data to a path pointed by a data warehouse when the internal table is created, and deleting the service data and metadata of the internal table together when the internal table is deleted; when the external table is created, only the path of the service data is recorded, and the storage position of the service data is not changed, and when the external table is deleted, only the metadata of the external table is deleted, and the service data is not deleted;
inquiring an HQL statement of a Hive build table corresponding to the service data through a SHOW CREATE TABLE command, and generating a metadata backup file recorded with the HQL statement;
backing up the snapshot data and the metadata backup file to a backup system by calling an interface provided by the backup system;
the backup of the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system comprises the following steps:
obtaining a backup record, wherein the backup record comprises a first data path, a snapshot file size and snapshot last modification time;
if the first data path is the same and the snapshot file size is the same as the last modification time of the snapshot, recording metadata of a service data file where the service data is located;
if the first data path is the same, the snapshot file size is not the same as any one of the snapshot last modification time, the service data file where the service data is located is backed up, and metadata of the service data file is recorded.
2. The data backup method for Hive according to claim 1, wherein the backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system comprises:
and carrying out concurrent backup on the snapshot data by using a RESTful interface provided by the WebHDFS and adopting a multithreading mode, wherein the concurrent backup is used for backing up the business data into the backup system.
3. A data recovery method for Hive, applied to recovery of service data and metadata backed up by the data backup method for Hive according to claim 1 or 2, comprising:
acquiring a service data recovery request, acquiring a data path of the service data stored in an HDFS based on the service data recovery request, and recovering the service data in a backup system to the HDFS through a data interface provided by the WebHDFS, or
Acquiring a metadata recovery request, acquiring an HQL file in the backup system based on the metadata recovery request, sequentially executing a deletion table recorded in the HQL file, a creation table and HQL sentences of a correction partition through a Hive algorithm, thereby clearing metadata of the Hive table through the HQL sentences of the deletion table, recovering metadata of the Hive table through the HQL sentences of the creation table, and associating the table where the metadata is located with service data in an HDFS through the HQL sentences of the correction partition to realize the recovery of the metadata.
4. A data backup system for Hive, comprising:
the system comprises a snapshot data module for acquiring service data, a snapshot data module and a snapshot data processing module, wherein the snapshot data module is used for acquiring a second storage path of the snapshot data generated based on the service data in a Hadoop Distributed File System (HDFS) based on a first storage path of the service data in a Hive table, and acquiring the snapshot data of the service data from the second storage path, wherein the Hive table comprises an internal table and an external table; moving service data to a path pointed by a data warehouse when the internal table is created, and deleting the service data and metadata of the internal table together when the internal table is deleted; when the external table is created, only the path of the service data is recorded, and the storage position of the service data is not changed, and when the external table is deleted, only the metadata of the external table is deleted, and the service data is not deleted;
the metadata backup file generation module is used for inquiring the HQL statement of the Hive build table corresponding to the service data through a SHOW CREATE TABLE command and generating a metadata backup file recorded with the HQL statement;
the business data and metadata backup module is used for backing up the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system;
the backup of the snapshot data and the metadata backup file to the backup system by calling an interface provided by the backup system comprises the following steps:
obtaining a backup record, wherein the backup record comprises a first data path, a snapshot file size and snapshot last modification time;
if the first data path is the same and the snapshot file size is the same as the last modification time of the snapshot, recording metadata of a service data file where the service data is located;
if the first data path is the same, the snapshot file size is not the same as any one of the snapshot last modification time, the service data file where the service data is located is backed up, and metadata of the service data file is recorded.
5. A data recovery system for Hive for recovering service data and metadata backed up by the data backup system for Hive according to claim 4, comprising:
the acquisition recovery request module is used for acquiring a service data recovery request or a metadata recovery request;
the service data recovery module is used for acquiring a data path of the service data stored in the HDFS based on the service data recovery request, and recovering the service data in the backup system into the HDFS through a data interface provided by the WebHDFS;
and the metadata recovery module is used for acquiring the HQL file in the backup system based on the metadata recovery request, sequentially executing a deletion table recorded in the HQL file, a creation table and HQL sentences of the correction partition through a Hive algorithm, cleaning metadata of the Hive table through the HQL sentences of the deletion table, recovering metadata of the Hive table through the HQL sentences of the creation table, and associating the table where the metadata is positioned with service data in the HDFS through the HQL sentences of the correction partition to realize the recovery of the metadata.
6. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data backup method for Hive according to claim 1 or 2 when executing the computer program and/or the processor implements the data recovery method for Hive according to claim 3 when executing the computer program.
7. A computer-readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the data backup method for Hive according to claim 1 or 2, and/or wherein the computer program when executed by a processor implements the data recovery method for Hive according to claim 3.
CN202311647374.8A 2023-12-04 2023-12-04 Data backup and data recovery method, system, equipment and medium for Hive Active CN117370080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311647374.8A CN117370080B (en) 2023-12-04 2023-12-04 Data backup and data recovery method, system, equipment and medium for Hive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311647374.8A CN117370080B (en) 2023-12-04 2023-12-04 Data backup and data recovery method, system, equipment and medium for Hive

Publications (2)

Publication Number Publication Date
CN117370080A CN117370080A (en) 2024-01-09
CN117370080B true CN117370080B (en) 2024-04-09

Family

ID=89402678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311647374.8A Active CN117370080B (en) 2023-12-04 2023-12-04 Data backup and data recovery method, system, equipment and medium for Hive

Country Status (1)

Country Link
CN (1) CN117370080B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157487A (en) * 2020-01-23 2021-07-23 华为技术有限公司 Data recovery method and apparatus thereof
CN113485999A (en) * 2021-08-04 2021-10-08 中国工商银行股份有限公司 Data cleaning method and device and server
CN113986616A (en) * 2021-11-02 2022-01-28 浪潮云信息技术股份公司 Method and system suitable for Hive data warehouse to perform data backup and recovery
CN114328020A (en) * 2021-12-28 2022-04-12 苏州浪潮智能科技有限公司 Data backup method and related device for cluster file system
CN116610498A (en) * 2023-07-14 2023-08-18 深圳市木浪云科技有限公司 Data backup and recovery method, system, equipment and medium based on object storage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157487A (en) * 2020-01-23 2021-07-23 华为技术有限公司 Data recovery method and apparatus thereof
CN113485999A (en) * 2021-08-04 2021-10-08 中国工商银行股份有限公司 Data cleaning method and device and server
CN113986616A (en) * 2021-11-02 2022-01-28 浪潮云信息技术股份公司 Method and system suitable for Hive data warehouse to perform data backup and recovery
CN114328020A (en) * 2021-12-28 2022-04-12 苏州浪潮智能科技有限公司 Data backup method and related device for cluster file system
CN116610498A (en) * 2023-07-14 2023-08-18 深圳市木浪云科技有限公司 Data backup and recovery method, system, equipment and medium based on object storage

Also Published As

Publication number Publication date
CN117370080A (en) 2024-01-09

Similar Documents

Publication Publication Date Title
US11809726B2 (en) Distributed storage method and device
US11520670B2 (en) Method and apparatus for restoring data from snapshots
US11481289B2 (en) Method and apparatus for reading and writing committed data
JP6777673B2 (en) In-place snapshot
CN110447021B (en) Method, apparatus and system for maintaining consistency of metadata and data between data centers
CA2533916C (en) File system represented inside a database
US10528341B2 (en) User-configurable database artifacts
US11093387B1 (en) Garbage collection based on transmission object models
CN113515487B (en) Directory query method, computing device and distributed file system
US11080253B1 (en) Dynamic splitting of contentious index data pages
US10387271B2 (en) File system storage in cloud using data and metadata merkle trees
CN103595797B (en) Caching method for distributed storage system
Shukla et al. Schema-agnostic indexing with Azure DocumentDB
US20210357140A1 (en) Restoring a storage system using file relocation metadata
Fetterly et al. {TidyFS}: A Simple and Small Distributed File System
CN107209707B (en) Cloud-based staging system preservation
CN115269631A (en) Data query method, data query system, device and storage medium
US20220342888A1 (en) Object tagging
CN117370080B (en) Data backup and data recovery method, system, equipment and medium for Hive
Saquib et al. Pedals: Persisting versioned data structures
US11550760B1 (en) Time-based partitioning to avoid in-place updates for data set copies
ZA200510092B (en) File system represented inside a database
CN115981570B (en) Distributed object storage method and system based on KV database
CN117009309B (en) File real-time synchronization method and device based on rsync
US11755538B2 (en) Distributed management of file modification-time field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant