WO2022151593A1 - Data recovery method, apapratus and device, medium and program product - Google Patents

Data recovery method, apapratus and device, medium and program product Download PDF

Info

Publication number
WO2022151593A1
WO2022151593A1 PCT/CN2021/084393 CN2021084393W WO2022151593A1 WO 2022151593 A1 WO2022151593 A1 WO 2022151593A1 CN 2021084393 W CN2021084393 W CN 2021084393W WO 2022151593 A1 WO2022151593 A1 WO 2022151593A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
node
data
recovery
restored
Prior art date
Application number
PCT/CN2021/084393
Other languages
French (fr)
Chinese (zh)
Inventor
阿瓦鲁卡纳卡库马尔
库马尔潘卡吉
智伟
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Priority to CN202180004239.4A priority Critical patent/CN115087966A/en
Publication of WO2022151593A1 publication Critical patent/WO2022151593A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the embodiments of the present application relate to the technical field of databases, and in particular, to a data recovery method, apparatus, device, medium, and program product.
  • a distributed database such as an HBase database, usually includes a master (master) node and at least one partition server (Region Server, RS) node.
  • the master node is used to allocate the partition (region) that the RS node is responsible for for each RS node, and the number of partitions allocated to each RS node can be one or more;
  • the write request executes the corresponding data write process.
  • the RS node when the RS node writes data, it can first write the data write request into the write ahead log (Write ahead log, WAL), and after the writing is successful, insert the data to be written into the memory of the RS node; When the amount of data in the memory of the RS node reaches a preset threshold, the RS node can persistently store the data in the memory.
  • WAL write ahead log
  • the master node When the master node detects that there is an RS node failure, such as the RS node is powered off, maintenance restarts, etc., the master node will restore the unpersisted data in the RS node according to the WAL log.
  • the number of files generated by the RS node in the process of restoring data is large, which causes a large number of files to be transmitted between the RS node and the file system when the generated files are transferred from the RS node to the file system for persistent storage. interaction, resulting in low data recovery efficiency and high resource consumption of RS nodes.
  • the embodiments of the present application provide a data recovery method, so as to improve the data recovery efficiency of the RS node and reduce resource consumption.
  • the present application also provides corresponding apparatuses, devices, computer-readable storage media, and computer program products.
  • the embodiment of the present application provides a data recovery method. Specifically, after the service crash processing process is started, the WAL log created by the first RS node in the distributed database before the failure occurs, and the WAL log includes: Data processing records of multiple partitions to be restored belonging to the first RS node, and then writing the data processing records belonging to each partition to be restored in the obtained WAL file into the restoration file, wherein different data records in the restoration file The area records belong to the data processing records of different partitions to be restored, so that the restoration files with the data processing records can be further transferred to the file system for persistent storage.
  • the data recovery method for the multiple partitions to be recovered of the first RS node may be performed on the first RS node after failure recovery, that is, after the first RS node fails and resumes operation, it may Perform the data recovery process yourself.
  • the data recovery method may also be performed by other devices in the distributed database, where the other devices include a second RS node in the distributed database that is not faulty, and the second RS node represents any RS node that is not faulty; or , the other device may refer to a specific device in the distributed database for performing fault data recovery, and may be pre-configured in the distributed database by a technician.
  • the recovery file includes index information, where the index information is used to indicate the position offset in the recovery file of the data recording area corresponding to each partition to be recovered.
  • the index information is used to indicate the position offset in the recovery file of the data recording area corresponding to each partition to be recovered.
  • the index information is specifically used to indicate the position offsets in the recovery file of sub-files corresponding to multiple partitions to be recovered, and the sub-files are used to store data processing records in the partitions to be recovered, And different sub-files are used to store different data processing records.
  • the index information can be used to determine the sub-files belonging to each to-be-restored partition in the recovery file, so that the data processing records in the sub-files corresponding to each to-be-restored partition can be used to achieve data recovery of the to-be-restored partition;
  • a large number of sub-files are generated during the data recovery process, but after multiple sub-files are packaged into one recovery file, the number of files transferred to the file system can be reduced, thereby reducing the number of files transferred from the first RS node to the file system.
  • IO reduce resource consumption, and increase the scalability of distributed databases.
  • the recovery file when used to store the data processing records, it is not necessary to generate a sub-file to store the data processing records belonging to the partition to be recovered, and the data processing records may be directly stored in the recovery file. In this way, the number of files to be created during data recovery based on the WAL file can be reduced, thereby reducing the process of creating, moving, and deleting files, and improving data recovery efficiency.
  • the recovery file before transferring the recovery file to the file system, can be stored in the cache, so that the recovery file stored in the cache can be used to provide services for the clients of the distributed database.
  • Clients entering the distributed database provide corresponding services such as reading and writing data.
  • the distributed database provides services, it is possible to obtain recovery files without reading the file system remotely, thereby reducing resource consumption of remote bandwidth.
  • the recovery file stored in the cache is emptied. In this way, the cache resources are released, and the long-term occupation of the cache resources during the data recovery process is avoided as much as possible.
  • the data recovery instruction sent by the master node may be obtained first, and the data recovery process is implemented under the instruction of the data recovery instruction.
  • the data recovery instruction is used to instruct to perform data recovery on a plurality of partitions to be recovered in the first RS node. For example, after detecting that the first RS node is faulty, the master node may issue the data recovery instruction to the first RS node.
  • the format of the restored file is the format of the archive file, such as "*.har” format, ".tar” format, and so on.
  • the present application provides a data recovery device, the data recovery device includes: an acquisition module for acquiring the first partition server RS node in the distributed database created before the failure occurs after the service crash processing flow SCP is started
  • the write-ahead log WAL file the WAL file includes data processing records belonging to a plurality of partitions to be restored in the first RS node; a writing module is used to process the data of each partition to be restored in the WAL file
  • the records are written into a recovery file, wherein different data recording areas in the recovery file record data processing records belonging to different partitions to be recovered; a transmission module is used to transmit the recovery file to a file system for persistent storage.
  • the apparatus is applied to the first RS node after failure recovery, or applied to the execution of other devices in the distributed database, wherein the other devices include the distributed database.
  • the second RS node in the distributed database that is not faulty, or a specific device in the distributed database for performing faulty data recovery.
  • the recovery file includes index information, where the index information is used to indicate the position offset of the data recording area corresponding to each partition to be recovered in the recovery file.
  • the index information is specifically used to indicate the position offsets in the restoration file of subfiles corresponding to the multiple partitions to be restored, and the subfiles are used to store the to-be-restored subfiles Data processing records in the partition, different sub-files are used to store different data processing records.
  • the apparatus further includes: a storage module, for storing the recovery file in a cache before transmitting the recovery file to the file system; a service module for using the The recovery file stored in the cache provides services to clients of the distributed database.
  • the apparatus further includes: a data clearing module, configured to clear the restored file stored in the cache after transferring the restored file to the file system.
  • the acquiring module is further configured to acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct to perform data recovery on multiple partitions to be recovered in the first RS node recover.
  • the present application provides a computing device including a processor, a memory and a display.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute the instructions stored in the memory to cause the computing device to perform the data recovery method as in the first aspect or any one of the implementations of the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computing device, the computing device causes the computing device to perform the first aspect or any one of the first aspect.
  • the present application provides a computer program product containing instructions, which, when run on a computing device, enables the computing device to execute the data recovery method described in the first aspect or any implementation manner of the first aspect .
  • the present application may further combine to provide more implementation manners.
  • 1 is a schematic diagram of the architecture of an exemplary distributed database of the application
  • FIG. 2 is a schematic diagram of data processing records belonging to different partitions in the split WAL file
  • 3 is a schematic diagram of the relationship between the MTTR of the distributed database 100 and the number of RS nodes;
  • FIG. 4 is a schematic diagram of storing different data processing records in the WAL file in different data recording areas in the recovery file;
  • FIG. 5 is a schematic flowchart of a data recovery method according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the corresponding relationship between the partition to be restored and the data recording area in the restored file
  • FIG. 7 is a schematic flowchart of another data recovery method provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data recovery apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a hardware structure of a computing device according to an embodiment of the present application.
  • FIG. 1 it is a schematic diagram of the architecture of an exemplary distributed database 100 .
  • the distributed database 100 includes a master node 101 , an RS node 102 and an RS node 103 .
  • the master node 101 is used to divide the data stored and managed by the distributed database 100 to obtain multiple partitions, each partition includes one or more pieces of data, and the data belonging to different partitions are usually different.
  • part of the content of the piece of data may be used as a primary key corresponding to the piece of data, and the primary key It is used to uniquely identify this piece of data in the distributed database 100, so that the master node 101 can perform interval division according to the possible value range of the primary key, and each divided interval corresponds to a partition.
  • the primary node 101 can divide the value range of the primary key into 100 intervals, which are [0, 10000), [10000, 20000), ..., [980000, 99000), [990000, 1000000], each partition can be used to store 10,000 pieces of data, correspondingly, based on the 100 partitions, the distributed database 100 can store and manage 1 million pieces of data.
  • the master node 101 is also used for allocating partitions to the RS node 102 and the RS node 103 , and the partitions allocated to each RS node can be maintained through the management table created by the master node 101 .
  • the RS node 102 and the RS node 103 are respectively used to execute data read and write services belonging to different partitions. As shown in FIG. 1 , the RS node 102 executes data read and write services belonging to the partitions 1 to N, while the RS node 103 executes the data read and write services belonging to the partition N+1. Data read and write services to partition M. It is worth noting that, in FIG. 1, the distributed database 100 includes one master node 101 and two RS nodes as an example for illustration. In other possible distributed databases 100, the number of master nodes 101 and RS nodes may also be Any value, which is not limited in this application.
  • the master node 101 , the RS node 102 and the RS node 103 can all be implemented by hardware or software.
  • both the master node 101 and the multiple RS nodes may be physical servers in the distributed database 100 . That is, during actual deployment, at least one server in the distributed database 100 may be configured as the master node 101, and other servers in the distributed database 100 may be configured as RS nodes.
  • the master node 101 and each RS node may be processes or virtual machines running on one or more devices (eg, servers, etc.), respectively.
  • the distributed database 100 can be used as a local resource to provide local data read and write services to clients accessing the distributed database 100 through the master node 101 , the RS node 102 and the RS node 103 .
  • the distributed database 100 can also be deployed in the cloud.
  • the master node 101 , the RS node 102 and the RS node 103 can provide cloud services for reading and writing data to clients accessing the cloud.
  • the distributed database 100 may be connected to the client 200 and the file system 300 respectively, for example, the connection may be performed through a wireless communication protocol such as HyperText Transfer Protocol (HTTP).
  • HTTP HyperText Transfer Protocol
  • the client 200 can send a data write request to the RS node 102, and the data write request carries the data to be written in the distributed database 100.
  • Data hereinafter referred to as data to be written
  • corresponding data processing operation content such as write operation, modification operation, etc.
  • the RS node 102 may first generate a corresponding data processing record based on the data to be written and the data processing operation in the data writing request, and write the data processing record into a pre-created WAL file middle. After determining that the WAL file is successfully written, the RS node 102 persistently stores the WAL file in the file system 300; for example, the file system 300 may use a data structure such as log structured merge trees (LSM Trees). Store WAL files. At the same time, the RS node 102 inserts the data to be written in the data processing record into the memory 1021 of the RS node 102 .
  • LSM Trees log structured merge trees
  • the RS node 102 can first determine the primary key corresponding to the data processing record, and determine which partition to write the data processing record into according to the partition interval to which the value of the key belongs, so that the RS node 102 can process the data
  • the data to be written in the record is inserted into the storage area corresponding to the partition in the memory 1021 ; then, the RS node 102 can report to the client 200 that the data writing is successful.
  • the RS node 102 writes data into the memory 1021 for one or more clients 200, so that the amount of data temporarily stored in the memory 1021 will increase continuously.
  • the RS node 102 can persistently store the data in the memory 1021 to the file system 300 , specifically, the data can be written to the distributed file system 300 in the form of a file.
  • the file system 300 may be, for example, a distributed file system (distributed file system, DFS), a Hadoop distributed file system (hadoop distributed file system, HDFS), etc., which is not limited in this embodiment.
  • the RS node 102 is also configured with region store files for each partition, and after persistently storing the data in the file system 300, the RS node 102 can store the data of each partition in the file system 300.
  • the file stored in the partition is added to the partition storage file corresponding to the partition, specifically, the file name corresponding to each data in the partition is added under the directory of the partition storage file. Then, the RS node 102 may merge and delete the files included in each partition storage file, so as to eliminate the old version data in the partition storage file.
  • the partition storage file may include the file corresponding to data A a and the file b corresponding to the data B, and when the RS node 102 merges the partition storage files, it may specifically merge the file a and the file b into the file b, that is, only the file b corresponding to the new version of the data B is retained.
  • the RS nodes in the distributed database 100 will inevitably suffer from failures such as power failure, maintenance restart, etc., which causes the data in the RS node memory to be lost. Therefore, when the master node 101 detects that the RS node 102 fails, usually A process called a server crash procedure (SCP) will be started to perform data recovery for the RS node 102, specifically to recover data lost in the memory 1021 of the RS node 102. Specifically, the SCP process will identify all the WAL files belonging to the failed RS node 102 from the file system 300, and designate the normal running RS node 103 to split the data processing records in each WAL file. As shown in FIG.
  • SCP server crash procedure
  • the RS node 103 specifically splits the data processing records belonging to different partitions in the RS node 102 in the WAL file, and then creates a separate recovery file for each partition to save the WAL file belonging to the partition.
  • the data processing record is used to recover the data in the partition by using the data processing record in the recovery file (for example, the data is recovered by playing back the data processing operation, etc.), and the client 200 can continue to provide corresponding services based on the recovered data.
  • the RS node 103 persistently stores the recovery files corresponding to each partition to the file system one by one.
  • the RS node 103 when transferring each restored file, the RS node 103 first sends a notification message to the file system 300 to notify the file system 300 that file transfer is currently required; after receiving the feedback from the file system 300, the RS node 103 restores the file Transfer to the file system 300; finally, after determining that the recovery file is successfully transferred to the file system 300, the RS node 103 sends a shutdown notification to the file system 300 to notify the file system 300 to end the transfer of the recovery file.
  • the RS node 103 Normally, the RS node 103 generates N recovery files for each WAL file to record the data belonging to the N partitions in the WAL file. If the number of WAL files belonging to the faulty RS node 102 is M, then the number of WAL files belonging to the faulty RS node 102 is M. When the RS node 102 performs data recovery, the RS node 103 needs to create (M*N) recovery files. In this way, the RS node 103 needs to sequentially perform the above-mentioned file transfer process for (M*N) restored files.
  • the distributed database 100 is in the process of performing data recovery for P RS nodes, and the total number of recovery files that need to be generated and transmitted to the file system 300 is (M *N*P).
  • the RS node 103 needs to perform the above-mentioned notification, transmission, and closing processes for each restored file, which makes the RS node 103 transmit the file to the file system 300.
  • the input/output (IO) of the file system is relatively large, and the system calls and resource consumption of the file system 300 are relatively high.
  • IO input/output
  • a large number of interaction processes between the RS node 103 and the file system 300 will also lead to low data recovery efficiency of the distributed database 100 .
  • the mean time to recover (MTTR) of the distributed database 100 increases gradually (such as approximately exponentially increasing, etc.), which The scalability of the distributed database 100 is limited.
  • the data recovery test results for 100 RS nodes show that the relationship between the MTTR of the distributed database 100 and the number of RS nodes included in the distributed database 100 is as shown in FIG. 3, which is approximately exponential growth , the scalability of the distributed database 100 is low.
  • the embodiments of the present application provide a data recovery method, so as to improve data recovery efficiency and reduce resource consumption.
  • the RS node 103 first obtains the WAL file created by the RS node 102 before the failure occurs, and the WAL file includes the N partitions to be restored belonging to the RS node 102 Each partition to be restored is also a partition in which data loss occurs in the memory 1021 of the RS node 102 that has failed.
  • the RS node 103 may write the data processing records of the respective partitions to be restored in each WAL file into the restoration file, wherein the restoration file includes a plurality of data recording areas, and the records of different data recording areas belong to different The data processing record of the partition to be restored is shown in Figure 4.
  • the RS node 103 persistently stores the recovery file integrated with the data processing records in one or more WAL files to the file system 300 .
  • the data of each partition lost in the memory 1021 of the RS node 102 is the data to be written in the data processing records belonging to the partition in each recovery file.
  • the number of recovered files to be transmitted can be reduced from (M*N) to M (or to value less than M), thereby effectively reducing the number of files that need to be transferred to the file system 300 in the data recovery process, thereby reducing the IO of the RS node 103 to transfer files to the file system 300, reducing resource consumption, and increasing the number of files in the distributed database 100.
  • Scalability, and at the same time, the data recovery efficiency of the distributed database 100 is also improved because the number of interactions between the RS node 103 and the file system 300 is reduced.
  • FIG. 5 it is a schematic flowchart of a data recovery method in an embodiment of the present application.
  • This method can be applied to any RS node that can operate normally in the distributed database 100 shown in FIG. 1, including the RS node 103 and the RS node 102.
  • the RS node 102 resumes operation after restarting, the RS node 102 can automatically operate according to the WAL File recovery of lost data in memory 1021.
  • the method may also be applied to a device separately configured in the distributed database 100 and specifically for performing fault data recovery. This embodiment does not limit this.
  • the data recovery method shown in FIG. 5 may specifically include:
  • S501 The master node 101 determines that the RS node 102 is faulty, and instructs the RS node 103 to perform data recovery.
  • each RS node in the distributed database 100 may periodically send a heartbeat message to the master node 101 .
  • the master node 101 normally receives the heartbeat message sent by the RS node 102, it can determine that the RS node 102 is not faulty; and when the master node 101 does not receive the heartbeat message from the RS node 102, it can determine that the RS node 102 is faulty .
  • the RS node 102 in the data processing system 100 fails and does not lose the communication function with the master node 101 (for example, the RS node restarts abnormally, etc.)
  • the RS node 102 can The master node 101 sends a failure notification to notify the master node 101 of its failure.
  • the specific implementation manner of how the master node 101 implements the detection of the faulty RS node is not limited.
  • the master node 101 when the master node 101 determines that the RS node is faulty, the master node 101 can start the SCP process, identify the failed RS node 102, and indicate other RS nodes that have not failed, that is, the RS node 103 in FIG. 1, Perform the appropriate data recovery procedure.
  • the master node 101 may send a data recovery instruction to the RS node 103, so as to use the data recovery instruction to instruct the RS node 103 to perform data recovery on multiple partitions to be recovered in the RS node 102.
  • the master node 101 may instruct the other RS nodes to perform the data recovery process.
  • the master node 101 may instruct the RS node with the least load to perform the data recovery process according to the load of each non-faulty RS node.
  • the RS node 103 obtains a WAL file created by the RS node 102 before the failure occurs, where the WAL file includes data processing records belonging to multiple partitions to be restored in the RS node 102.
  • the master node 101 may send the identification of the failed RS node 102 to the RS node 103, for example, the RS node 102's identity label (Identity, ID), factory serial number, etc. It is provided to the RS node 103 to instruct the RS node 103 for which RS node to perform data recovery, and can further obtain the WAL file belonging to the faulty RS node 102 by accessing the file system 300 .
  • the RS node 103 may send the identification of the failed RS node 102 to the RS node 103, for example, the RS node 102's identity label (Identity, ID), factory serial number, etc. It is provided to the RS node 103 to instruct the RS node 103 for which RS node to perform data recovery, and can further obtain the WAL file belonging to the faulty RS node 102 by accessing the file system 300 .
  • the WAL file Since the WAL file is created by the RS node 102 before the failure occurs, the WAL file records the data written to each partition in the RS node 102, so that the RS node 103 can use the obtained WAL file for the RS node 102. Data recovery is performed on each partition (for the convenience of distinction, the partition in the RS node 102 is hereinafter referred to as the partition to be recovered). Wherein, the number of WAL files belonging to the RS node 102 may be one or more.
  • the file system 300 may create a folder for each RS node, and add the WAL file created by the RS node to the folder corresponding to the RS node.
  • the RS node 103 obtains the WAL file created by the RS node 102, it can access the folder corresponding to the RS node 102 in the file system 300 to obtain the required WAL file.
  • the RS node 103 may also obtain the WAL file in other ways.
  • the file name of the WAL file may include the identifier of the RS node (such as the RS node's identifier). name, etc.), so that the file system 300 can find out the WAL file with the ID of the RS node 102 in the file system 300 .
  • the specific implementation manner for the RS node 103 to acquire the WAL file is not limited.
  • the RS node 103 writes the data processing records belonging to each to-be-restored partition in the WAL file into the restoration file, wherein different data recording areas in the restoration file record data processing records belonging to different to-be-restored partitions.
  • the RS node 103 can determine whether data loss occurs in the memory 1021 of the RS node 102 according to the WAL file, and restore the data according to the WAL file when it is determined that there is data loss in the memory 1021 Lost data in the memory 1021 of the RS node 102 .
  • the RS node 103 can determine from the acquired WAL file whether there is a data processing record without a persistent mark, and if so, it indicates that the data to be written in the part of the data processing record has not been persistently stored in the memory 1021 to the file system 300, so that the RS node 103 can restore the data in the memory 1021 based on these data processing records without persistent marks.
  • the WAL file obtained by the RS node 103 there may be some data processing records in the WAL file irrelevant to the data written in the memory 1021 before the failure of the RS node 102, as included in the data processing records in some WAL files.
  • the data to be written is the data of the old version, and the data in the memory 1021 of the node 102 before the failure is the data of the new version, etc.; or, the data to be written included in the data processing record in some WAL files is the RS node 102.
  • the RS node 103 can also filter the acquired WAL files to reduce the number of WAL files involved in the data recovery calculation, thereby reducing data recovery.
  • the amount of computation that needs to be performed in the process reduces resource consumption.
  • the RS node 103 filters the data processing records of the old version, it can obtain the data processing records with the same primary key in the WAL file, and determine the data processing records of the old version according to the time stamps corresponding to each data processing record. , so as to filter the data processing records of the old version and retain the data processing records of the new version.
  • the RS node 103 When the RS node 103 filters the data of the deleted partition, it can match according to whether the partition identifier (such as the partition name, etc.) in each data processing record in the WAL file is the partition identifier of any partition to be restored currently.
  • the partition identifier of 1 does not match the partition identifier of each partition.
  • the partition identifier of the data processing record is usually the partition identifier of the partition that has been deleted in the RS node 102, and the RS node 103 determines to filter the data processing record.
  • the data processing records written in the WAL file and written to the RS node 102 may belong to different partitions to be restored of the RS node 102, and the RS node 102 may be different according to the data belonging to the different partitions to be restored before the failure occurs.
  • Partitioned clients provide corresponding data read (and write) services.
  • the RS node 103 performs data recovery, it can split the data processing records in the WAL file, and determine the data processing records belonging to each partition to be recovered.
  • the data processing records recorded in the WAL file may exist in the form of key-value pairs (key-value, KV), and different key-value pairs may belong to different Partition to be restored.
  • the key (key) in the key-value pair may indicate the to-be-restored partition in the WAL file, for example, the identifier of the to-be-restored partition, etc.; the value (value) in the key-value pair belongs to the to-be-restored partition. data processing records.
  • the RS node 103 when the RS node 103 splits the data processing records for each WAL file, it can read each key-value pair in the WAL file, and determine the value in the key-value pair according to the key in the key-value pair The partition to be restored belongs to.
  • the RS node 103 may store them in the same file (hereinafter referred to as the restoration file), that is, the RS node 103 performs data disassembly for one or more WAL files
  • the restoration file includes multiple non-overlapping data recording areas, and each data recording area is used to record data processing records belonging to one partition to be recovered. Of course, different data recording areas are used to record data belonging to different partitions to be recovered. Process records.
  • the recovery file may include index information, and the index information may be used to indicate the position offset of the data recording area corresponding to each partition to be recovered in the recovery file.
  • the position space between two adjacent position offsets indicated in the index information can be used as a data recording area for recording data processing records belonging to the same partition to be restored.
  • the offset may be, for example, the head address of the data recording area or the like.
  • the partition to be restored and the data recording area in the restored file are Correspondence can be as shown in Figure 6: belong to the data processing record in the subarea f1 to be restored, be stored in the data recording area [logical address A, logical address B); belong to the data processing record in the subarea f2 to be restored, be stored In the data recording area [logical address B, logical address C); belong to the data processing record in the partition f3 to be restored, be stored in the data recording area [logical address C, logical address D); belong to the partition f4 to be restored.
  • the data processing record is stored in the data recording area [logical address D, logical address E].
  • the index information included in the recovery file may include 4 key-value pairs, wherein the key in key-value pair 1 is the identifier of the partition f1 to be restored, the value in key-value pair 1 is the logical address A, and the index information contains and so on for the rest of the key-value pairs.
  • the partition to be restored to which the data belongs is determined, and the data processing records belonging to different to-be-restored partitions can be distinguished without creating a separate file for each to-be-restored partition.
  • the number of files that the RS node 103 needs to create when performing data recovery based on the WAL file can be reduced, thereby reducing processes such as file creation, movement, and deletion, and improving data recovery efficiency.
  • each data record area in the recovery file can be directly used to store data processing records, and in other possible embodiments, the RS node 103 can also create a separate sub-file for each partition to be restored. , and the sub-file corresponding to each partition to be restored is used to record the data processing records belonging to the partition to be restored in the WAL file. In this way, for M WAL files and N partitions to be restored, the number of sub-files generated by the RS node 103 is (M*N).
  • the RS node 103 can package a plurality of sub-files into a recovery file, so that the RS node 103 transmits the sub-files to the file system 300.
  • the file transfer process can be performed only once, thereby reducing the number of interactions between the RS node 103 and the file system 300; moreover, changes to existing solutions can be minimized to improve the feasibility of solution implementation.
  • the RS node 103 may write the created N sub-files (corresponding to the N partitions to be restored) into the corresponding data recording areas in the restored file.
  • the index information may specifically be the position offset (that is, the data recording area) of the subfile corresponding to each to-be-restored partition when it is stored in the file restoration.
  • the sub-file of each partition to be restored is used to record the data processing records belonging to the partition to be restored, and the data processing records recorded by different sub-files are different.
  • the N sub-files corresponding to the multiple WAL files may all be written into the same recovery file, which is not limited in this embodiment.
  • S504 The RS node 103 transmits the recovery file to the file system 300, so that the file system 300 can persistently store the recovery file.
  • the RS node 103 can perform split processing on a plurality of WAL files based on the above process, so that data processing records belonging to each to-be-restored partition can be recovered, and the recovered files can be persistently stored in the file system. In this way, when the RS node 103 performs data recovery for the N partitions to be recovered based on the M WAL files, the number of files (or the number of times of transferring files) to the file system 300 does not exceed M.
  • the RS node 103 can play back the data processing operations in each data processing record corresponding to the to-be-restored partition, and recover the data lost in the memory 1021 of the RS node 102 when the failure occurs .
  • the RS node 103 can provide the client 200 with services for reading data, writing data, or deleting data based on the recovered data to be written belonging to the respective partitions to be restored. query of incoming data, etc. While the RS node 103 is performing data recovery, the RS node 103 may provide the client 200 with services such as writing data, deleting data, and the like.
  • the RS node 103 persistently stores the recovery file corresponding to the WAL file in the file system 300, it can realize the re-online of the partition to be recovered based on these recovery files, that is, the RS node 103 can re-distribute the partition based on the partition to be recovered.
  • the client of the database 100 provides services such as read and write, and informs the master node 101 that the partition to be restored is included in the partition currently running normally, so that the master node 101 manages the to-be-restored partition.
  • the RS node 103 when the RS node 103 goes online to the target to-be-restored partition, it can read the recovery file corresponding to the target to-be-restored partition from the file system 300, and re-launch the to-be-restored partition based on the data processing record in the recovery file.
  • the RS node 103 may implement the partition to be restored back online based on the restoration file in the cache.
  • the RS node 103 is configured with a cache, and after generating the corresponding restoration file based on the WAL file, the RS node 103 may store the restored file in the cache of the RS node 103 before transmitting the restoration file to the file system.
  • the RS node 103 can directly read the recovery file from the cache, and use the data processing records recorded in the recovery file that belong to the target partition to be recovered to provide data reading for the client 200 Services (including queries, modifications, etc. involving read data).
  • the RS node 103 when the RS node 103 re-launches the target to-be-restored partition, it can read the restoration file and the relevant information (such as file size, data length, etc.) from the distributed file system 300 without using a remote call. Thereby, system calls and corresponding resource consumption can be reduced.
  • the RS node 103 can clear the cache. , so as to release the buffer resources of the RS node 103 and avoid long-term occupation of the buffer resources of the RS node 103 during the data recovery process as much as possible.
  • the format of the recovery file may be, for example, the format of an archive (archival) file, such as "*.har” format, ".tar” format, and the like. In this embodiment, the specific implementation of the archive file format is not limited.
  • the RS node 103 after the RS node 103 re-launches the target partition to be restored, it can also notify the master node 101 to update the partition in the management table that the RS node 103 is allocated to and can provide services, so that the master node 101 can update the partition based on the updated management table.
  • the management table further manages the partitions allocated to each RS node. If the master node 101 determines that the number of partitions allocated to some RS nodes is too large based on the updated management table, the partial partitions on the part of the RS nodes can be transferred to other RS nodes to balance the allocation of partitions on each RS node, etc.
  • the RS node 103 can also update the partition storage file corresponding to each partition to be recovered, specifically the file in the directory of the partition storage file. Merge and delete to remove old versions of data stored in the partition.
  • the data recovery process performed by the RS node 103 that is not faulty is taken as an example for description.
  • data recovery for the RS node 102 may also be performed by a device independently configured in the distributed database 100 .
  • the master node 101 may also instruct the RS node 102 to perform data recovery.
  • another data recovery method provided by the embodiment of the present application will be described by taking the RS node 102 recovering operation after a failure recovering data in its own partition as an example.
  • FIG. 7 it is a schematic flowchart of a data recovery method.
  • the method is mainly applied to the RS node 102, and the method may specifically include:
  • the master node 101 After determining that the RS node 102 is faulty, the master node 101 further determines whether the RS node 102 resumes operation within a preset time period.
  • the master node 101 can preferentially wait for whether the RS node 102 can resume normal operation within a preset time period (eg, 3 minutes, 5 minutes, etc.) after the fault occurs. If the RS node 102 can resume operation after the failure, the master node 101 can arrange the RS node 102 to perform the failure recovery by itself, and if the RS node 102 fails to resume operation in time after the failure, the master node 101 can arrange other RS nodes
  • the node eg, the RS node 103 in the aforementioned embodiment
  • the master node 101 may send a data restoration instruction to the RS node 102 to instruct the RS node 102 to restore the data by itself.
  • the RS node 102 obtains the WAL file created by the RS node 102 from the file system 300, where the WAL file includes data processing records belonging to multiple to-be-restored partitions in the RS node 102.
  • the RS node 102 accesses the WAL folder corresponding to the RS node 102 in the file system 300 according to the received data recovery instruction, and reads the created multiple WAL files from the WAL folder. Normally, the WAL files created by the RS node 102 in the process of writing data for the client 200 are added to the WAL folder corresponding to the RS node 102. Therefore, the RS node 102 can The WAL file performs data recovery for multiple partitions to be recovered of the RS node 102 .
  • the RS node 102 can read the WAL file from the local file system 300, and can obtain the WAL file from the file system 300 without remote access. In this way, distributed distribution can be effectively reduced.
  • the RS node 102 can also filter the acquired WAL files to filter out some WAL files unrelated to the data written in the memory 1021 before the failure of the RS node 102, thereby reducing the number of WAL files involved in the data recovery calculation. In turn, the amount of computation that needs to be performed in the data recovery process can be reduced, and resource consumption can be reduced.
  • the RS node 102 writes data processing records belonging to multiple partitions to be restored in the WAL file into the restoration file, wherein different data recording areas in the restoration file record data processing records belonging to different partitions to be restored.
  • the RS node 102 can directly write the data processing records belonging to each partition to be restored in the WAL file into the pre-created restoration file, and the restoration file includes a plurality of different data recording areas, and each The to-be-restored partitions all correspond to at least one data recording area in the restoration file, and data recording areas corresponding to different to-be-restored partitions may not overlap.
  • the RS node 102 when the RS node 102 records the data processing belonging to the to-be-restored partition in the write-recovery file, it may specifically write the data processing record into the data recording area corresponding to the to-be-restored partition in the recovery file.
  • the subsequent RS node 102 can determine the to-be-restored partition to which the data processing record belongs based on the position of the data in the restoration file.
  • the corresponding relationship between the partition to be restored and the data recording area in the restored file can be recorded by corresponding index information, and the index information can be integrated into the restored file.
  • the RS node 102 may create a sub-file for each partition to be restored when splitting the data processing records in each WAL file, and the sub-file corresponding to each partition to be restored The file is used to record data processing records belonging to the to-be-restored partition in the WAL file.
  • the RS node 102 may create (M*N) sub-files.
  • the RS node 102 may add multiple subfiles to the same recovery file, for example, the RS node 102 may package N subfiles into one recovery file, so as to (M*N) sub-files, the RS node 102 will package to obtain M recovery files.
  • M*N sub-files
  • the sub-file corresponding to the target to-be-restored file may be added by the RS node 102 to the data recording area corresponding to the target to-be-restored partition in the restored file, and the difference between the to-be-restored partition and the data recording area in the restored file is different.
  • the one-to-one correspondence can be recorded through the index information in the recovery file.
  • the RS node 102 transmits the multiple sub-files to the file system 300, since the multiple sub-files are packaged into a restored file, the RS node 102 can perform the file transfer process only once, thereby reducing the number of files between the RS node 102 and the file.
  • the number of interactions between the systems 300; and, changes to existing solutions can be minimized, and the feasibility of solution implementation can be improved.
  • the RS node 102 transmits the recovery file to the file system 300 for persistent storage, and realizes the re-online of the partition to be recovered.
  • the RS node 102 when the RS node 102 re-launches the target to-be-restored partition, the RS node 102 can read the recovery file corresponding to the target to-be-restored partition from the file system 300, and re-launch the to-be-restored partition based on the data processing records in the recovery file.
  • the generated recovery file may be stored in the cache, so that the RS node 102 may directly read the recovery file from the cache, and use the data recorded in the recovery file and belonging to the target partition to be recovered
  • the processing record provides the client 200 with a data read service.
  • FIG. 8 is a schematic structural diagram of a data recovery apparatus provided by the present application.
  • the data recovery apparatus 800 can be applied to any node in the above-mentioned distributed database, such as the RS node 102 that resumes operation after a failure or an RS that does not fail.
  • the data recovery device 800 includes:
  • the obtaining module 801 is configured to obtain the write-ahead log WAL file created by the first partition server RS node in the distributed database before the failure occurs after the service crash processing process is started, and the WAL file includes files belonging to the first RS node.
  • the writing module 802 is used to write the data processing records of each partition to be recovered in the WAL file into the recovery file, wherein different data recording area records in the recovery file belong to the data processing records of different partitions to be recovered;
  • the transmission module 803 is configured to transmit the recovery file to the file system for persistent storage.
  • the data recovery apparatus 800 is applied to the first RS node (such as the RS node 102 in FIG. 1 ) after failure recovery, or is applied to other nodes in the distributed database equipment execution, wherein the other equipment includes a second RS node (such as the RS node 103 in FIG. 1 ) that is not faulty in the distributed database, or a specific one in the distributed database for performing fault data recovery device of.
  • the first RS node such as the RS node 102 in FIG. 1
  • the other equipment includes a second RS node (such as the RS node 103 in FIG. 1 ) that is not faulty in the distributed database, or a specific one in the distributed database for performing fault data recovery device of.
  • the restoration file includes index information, where the index information is used to indicate the position offset of the data recording area corresponding to each partition to be restored in the restoration file.
  • the index information is specifically used to indicate the position offsets in the restoration file of subfiles corresponding to the multiple partitions to be restored, and the subfiles are used to store the to-be-restored subfiles Data processing records in the partition, different sub-files are used to store different data processing records.
  • the data recovery apparatus 800 further includes:
  • a storage module 804 configured to store the restored file in a cache before transmitting the restored file to the file system
  • the service module 805 is configured to provide services for the clients of the distributed database by using the restored files stored in the cache.
  • the device further includes:
  • the data clearing module 806 is configured to clear the restored file stored in the cache after the restored file is transmitted to the file system.
  • the acquiring module 801 is further configured to acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct the execution of multiple to-be-recovered partitions in the first RS node. Data Recovery.
  • the data recovery apparatus 800 may correspond to executing the methods described in the embodiments of the present application, and the above-mentioned and other operations and/or functions of the various modules of the data recovery apparatus 800 are for realizing the RS in FIG. 5 and FIG. 7 , respectively.
  • the node 102 or the RS node 103 executes the corresponding process in the method, which is not repeated here for brevity.
  • Figure 9 provides a computing device.
  • the computing device 900 may be, for example, the RS node 102 or the RS node 103 that does not fail in the previous embodiment, or a device specifically used for executing failure recovery data in the distributed database 100 etc., and the computer device 900 can be specifically used to implement the functions of the data recovery apparatus 800 in the above-mentioned embodiment shown in FIG. 8 .
  • Computing device 900 includes bus 901 , processor 902 and memory 903 .
  • a bus 901 communicates between the processor 902 and the memory 903.
  • the bus 901 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the processor 902 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP), or a digital signal processor (digital signal processor, DSP), etc. any one or more of the devices.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 903 may include volatile memory, such as random access memory (RAM).
  • RAM random access memory
  • the memory 903 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard drive (hard drive, HDD) or solid state drive (solid state drive) , SSD).
  • ROM read-only memory
  • HDD hard drive
  • solid state drive solid state drive
  • the memory 903 stores executable program codes, and the processor 902 executes the executable program codes to execute the data recovery method performed by the RS node 102 or the RS node 103 to which the data recovery apparatus 800 is applied.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that contains one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.
  • the computer-readable storage medium includes instructions that instruct a computing device to perform the data recovery method described above.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computing device, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted over a wire from a website site, computer or data center. (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) to another website site, computer or data center.
  • a website site e.g coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer program product can be a software installation package, which can be downloaded and executed on a computing device when any of the aforementioned object recognition methods needs to be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data recovery method, apparatus and device, a medium and a program product. After a service crash processing flow is started, a WAL created by a first RS node in a distributed database before a fault occurs is acquired, the WAL comprising data processing records of a plurality of regions to be subjected to recovery, which regions belong to the first RS node; and then the data processing records, which belong to each region to be subjected to recovery, in an acquired WAL file are written in a recovery file, wherein different data recording areas in the recovery file record the data processing records belonging to different regions to be subjected to recovery, so that the recovery file having the data processing records can be further transmitted to a file system (300) for persistent storage. Therefore, the number of files needing to be transmitted to the file system (300) during a data recovery process can be effectively reduced, such that IO for transmitting files to the file system (300) can be reduced, resource consumption can be decreased, and the data recovery efficiency of a distributed database can be improved.

Description

一种数据恢复方法、装置、设备、介质及程序产品A data recovery method, device, equipment, medium and program product
本申请要求于2021年01月13日递交印度专利局、申请号为202131001638,发明名称为“DISTRIBUTED DATABASE SYSTEM RECOVERY MECHANISM OPTIMIZATION METHOD AND SYSTEM”的印度专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Indian patent application with the application number 202131001638 and the invention titled "DISTRIBUTED DATABASE SYSTEM RECOVERY MECHANISM OPTIMIZATION METHOD AND SYSTEM" filed with the Indian Patent Office on January 13, 2021, the entire contents of which are incorporated herein by reference Applying.
技术领域technical field
本申请实施例涉及数据库技术领域,尤其涉及一种数据恢复方法、装置、设备、介质及程序产品。The embodiments of the present application relate to the technical field of databases, and in particular, to a data recovery method, apparatus, device, medium, and program product.
背景技术Background technique
在分布式数据库中,如HBase数据库等,通常包括主(master)节点以及至少一个分区服务器(Region Server,RS)节点。其中,master节点用于为各RS节点分配该RS节点所负责的分区(region),每个RS节点所分配到的分区数量可以是一个或者多个;RS节点用于根据属于其负责分区的数据写入请求执行相应的数据写入过程。其中,RS节点在写入数据时,可以先将数据写入请求写入预写日志(Write ahead log,WAL)中,并在写入成功后,将待写入数据插入RS节点的内存中;当RS节点的内存中数据的数据量达到预设阈值,RS节点可以将该内存中的数据进行持久化存储。A distributed database, such as an HBase database, usually includes a master (master) node and at least one partition server (Region Server, RS) node. Among them, the master node is used to allocate the partition (region) that the RS node is responsible for for each RS node, and the number of partitions allocated to each RS node can be one or more; The write request executes the corresponding data write process. Among them, when the RS node writes data, it can first write the data write request into the write ahead log (Write ahead log, WAL), and after the writing is successful, insert the data to be written into the memory of the RS node; When the amount of data in the memory of the RS node reaches a preset threshold, the RS node can persistently store the data in the memory.
当master节点检测到存在RS节点发生故障时,如该RS节点发生掉电、维护重启等,master节点会根据WAL日志对RS节点中未持久化的数据进行恢复。但是,RS节点在恢复数据的过程中所生成的文件数量较多,这使得在将生成的文件由RS节点传输至文件系统进行持久化存储时,RS节点与文件系统传输文件之间存在大量的交互,从而导致RS节点的数据恢复效率较低、资源消耗较高。When the master node detects that there is an RS node failure, such as the RS node is powered off, maintenance restarts, etc., the master node will restore the unpersisted data in the RS node according to the WAL log. However, the number of files generated by the RS node in the process of restoring data is large, which causes a large number of files to be transmitted between the RS node and the file system when the generated files are transferred from the RS node to the file system for persistent storage. interaction, resulting in low data recovery efficiency and high resource consumption of RS nodes.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本申请实施例提供了一种数据恢复方法,以提高RS节点恢复数据的效率、降低资源消耗。本申请还提供了对应的装置、设备、计算机可读存储介质以及计算机程序产品。In view of this, the embodiments of the present application provide a data recovery method, so as to improve the data recovery efficiency of the RS node and reduce resource consumption. The present application also provides corresponding apparatuses, devices, computer-readable storage media, and computer program products.
第一方面,本申请实施例提供一种数据恢复方法,具体的,在服务崩溃处理流程启动后,获取分布式数据库中的第一RS节点在发生故障前创建的WAL日志,该WAL日志中包括属于第一RS节点的多个待恢复分区的数据处理记录,然后,再将获取的WAL文件中属于各个待恢复分区的数据处理记录写入恢复文件中,其中,该恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录,从而可以进一步将具有数据处理记录的恢复文件传输至文件系统以进行持久化存储。In the first aspect, the embodiment of the present application provides a data recovery method. Specifically, after the service crash processing process is started, the WAL log created by the first RS node in the distributed database before the failure occurs, and the WAL log includes: Data processing records of multiple partitions to be restored belonging to the first RS node, and then writing the data processing records belonging to each partition to be restored in the obtained WAL file into the restoration file, wherein different data records in the restoration file The area records belong to the data processing records of different partitions to be restored, so that the restoration files with the data processing records can be further transferred to the file system for persistent storage.
由于每个待恢复分区中的数据处理记录,均被写入同一恢复文件中的相应数据记录区域中,从而在基于一个(或者多个)WAL文件进行数据恢复时,可以仅向文件系统传输一个恢复文件,而可以不用针对多个文件分别执行多次传输。如此,能够有效减少数据恢复过程中所需向文件系统300传输的文件数量,进而可以减少向文件系统传输文件的IO、降 低资源消耗,增加分布式数据库的可扩展性。同时,分布式数据库的数据恢复效率也会因为与文件系统之间的交互次数减少而得到提高。Since the data processing records in each to-be-restored partition are written into the corresponding data record areas in the same restoration file, when data restoration is performed based on one (or multiple) WAL files, only one WAL file can be transferred to the file system. Recover files without performing multiple separate transfers for multiple files. In this way, the number of files that need to be transferred to the file system 300 in the data recovery process can be effectively reduced, thereby reducing the IO for transferring files to the file system, reducing resource consumption, and increasing the scalability of the distributed database. At the same time, the data recovery efficiency of the distributed database will also be improved due to the reduction in the number of interactions with the file system.
在一种可能的实施方式中,针对第一RS节点的多个待恢复分区的数据恢复方法,可以故障恢复后的第一RS节点执行,即第一RS节点在发生故障并且恢复运行后,可以自行执行数据恢复过程。或者,数据恢复方法也可以是由分布式数据库中的其它设备执行,该其它设备包括分布式数据库中未发生故障的第二RS节点,该第二RS节点表征未发生故障的任意RS节点;或者,该其它设备可以是指分布式数据库中的特定用于执行故障数据恢复的设备,可以由技术人员预先配置于分布式数据库中。In a possible implementation manner, the data recovery method for the multiple partitions to be recovered of the first RS node may be performed on the first RS node after failure recovery, that is, after the first RS node fails and resumes operation, it may Perform the data recovery process yourself. Alternatively, the data recovery method may also be performed by other devices in the distributed database, where the other devices include a second RS node in the distributed database that is not faulty, and the second RS node represents any RS node that is not faulty; or , the other device may refer to a specific device in the distributed database for performing fault data recovery, and may be pre-configured in the distributed database by a technician.
在一种可能的实施方式中,恢复文件中包括索引信息,该索引信息用于指示各个待恢复分区所对应的数据记录区域在恢复文件中的位置偏移量。如此,在基于该恢复文件中存储的数据处理记录恢复待恢复分区中的数据时,可以根据该索引信息确定属于各个待恢复分区的数据处理记录在恢复文件中的位置,以此可以确定出恢复文件中属于该待恢复分区的数据处理记录。In a possible implementation manner, the recovery file includes index information, where the index information is used to indicate the position offset in the recovery file of the data recording area corresponding to each partition to be recovered. In this way, when the data in the partition to be recovered is recovered based on the data processing records stored in the recovery file, the position of the data processing records belonging to each partition to be recovered can be determined in the recovery file according to the index information, so that the recovery can be determined. The data processing records belonging to the partition to be restored in the file.
在一种可能的实施方式中,索引信息具体用于指示多个待恢复分区分别对应的子文件在恢复文件中的位置偏移量,该子文件用于存储待恢复分区中的数据处理记录,并且不同子文件用于存储不同的数据处理记录。这样,利用该索引信息可以确定出恢复文件中属于各个待恢复分区的子文件,从而利用每个待恢复分区对应的子文件中的数据处理记录实现对该待恢复分区的数据恢复;同时,虽然在数据恢复过程中生成了较多数量的子文件,但是将多个子文件打包成一个恢复文件后,向文件系统传输文件的数量可以得到减少,从而可以减少第一RS节点向文件系统传输文件的IO、降低资源消耗,增加分布式数据库的可扩展性。In a possible implementation manner, the index information is specifically used to indicate the position offsets in the recovery file of sub-files corresponding to multiple partitions to be recovered, and the sub-files are used to store data processing records in the partitions to be recovered, And different sub-files are used to store different data processing records. In this way, the index information can be used to determine the sub-files belonging to each to-be-restored partition in the recovery file, so that the data processing records in the sub-files corresponding to each to-be-restored partition can be used to achieve data recovery of the to-be-restored partition; A large number of sub-files are generated during the data recovery process, but after multiple sub-files are packaged into one recovery file, the number of files transferred to the file system can be reduced, thereby reducing the number of files transferred from the first RS node to the file system. IO, reduce resource consumption, and increase the scalability of distributed databases.
在一种可能的实施方式中,在利用恢复文件存储数据处理记录时,可以无需生成子文件来存储属于待恢复分区的数据处理记录,而可以直接将数据处理记录存储于恢复文件中。如此,可以减少在基于WAL文件进行数据恢复时所需创建的文件数量,从而可以减少文件的创建、移动、删除等过程,提高数据恢复效率。In a possible implementation, when the recovery file is used to store the data processing records, it is not necessary to generate a sub-file to store the data processing records belonging to the partition to be recovered, and the data processing records may be directly stored in the recovery file. In this way, the number of files to be created during data recovery based on the WAL file can be reduced, thereby reducing the process of creating, moving, and deleting files, and improving data recovery efficiency.
在一种可能的实施方式中,在将恢复文件传输至文件系统之前,可以将恢复文件存储于缓存,从而可以利用缓存中存储的恢复文件为分布式数据库的客户提供服务,具体可以是为接入分布式数据库的客户端提供相应的读、写数据等服务。如此,分布式数据库在提供服务是,可以无需通过远程读取文件系统的方式获取恢复文件,从而可以减少远程带宽的资源消耗。In a possible implementation manner, before transferring the recovery file to the file system, the recovery file can be stored in the cache, so that the recovery file stored in the cache can be used to provide services for the clients of the distributed database. Clients entering the distributed database provide corresponding services such as reading and writing data. In this way, when the distributed database provides services, it is possible to obtain recovery files without reading the file system remotely, thereby reducing resource consumption of remote bandwidth.
在一种可能的实施方式中,在将恢复文件传输至文件系统后,清空缓存中存储的恢复文件。如此,以释放缓存资源,尽可能避免数据恢复过程中对于缓存资源的长期占用。In a possible implementation, after the recovery file is transferred to the file system, the recovery file stored in the cache is emptied. In this way, the cache resources are released, and the long-term occupation of the cache resources during the data recovery process is avoided as much as possible.
在一种可能的实施方式中,在进行数据恢复过程中,可以是先获取主节点发送的数据恢复指令,并在该数据恢复指令的指示下实施该数据恢复过程。其中,该数据恢复指令用于指示对第一RS节点中的多个待恢复分区进行数据恢复。比如,当主节点检测到第一RS节点发生故障后,可以针对该第一RS节点下发该数据恢复指令。In a possible implementation manner, during the data recovery process, the data recovery instruction sent by the master node may be obtained first, and the data recovery process is implemented under the instruction of the data recovery instruction. Wherein, the data recovery instruction is used to instruct to perform data recovery on a plurality of partitions to be recovered in the first RS node. For example, after detecting that the first RS node is faulty, the master node may issue the data recovery instruction to the first RS node.
在一种可能的实施方式中,恢复文件的格式为存档文件的格式,如“*.har”格式、“.tar”格式等。In a possible implementation manner, the format of the restored file is the format of the archive file, such as "*.har" format, ".tar" format, and so on.
第二方面,本申请提供一种数据恢复装置,该数据恢复装置包括:获取模块,用于在服务崩溃处理流程SCP启动后,获取分布式数据库中的第一分区服务器RS节点在发生故障前创建的预写日志WAL文件,所述WAL文件包括属于所述第一RS节点中的多个待恢复分区的数据处理记录;写入模块,用于将所述WAL文件中各个待恢复分区的数据处理记录写入恢复文件中,其中,所述恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录;传输模块,用于将所述恢复文件传输至文件系统以进行持久化存储。In a second aspect, the present application provides a data recovery device, the data recovery device includes: an acquisition module for acquiring the first partition server RS node in the distributed database created before the failure occurs after the service crash processing flow SCP is started The write-ahead log WAL file, the WAL file includes data processing records belonging to a plurality of partitions to be restored in the first RS node; a writing module is used to process the data of each partition to be restored in the WAL file The records are written into a recovery file, wherein different data recording areas in the recovery file record data processing records belonging to different partitions to be recovered; a transmission module is used to transmit the recovery file to a file system for persistent storage.
在一种可能的实施方式中,所述装置应用于故障恢复后的所述第一RS节点,或者,应用于所述分布式数据库中的其它设备执行,其中,所述其它设备包括所述分布式数据库中未发生故障的第二RS节点,或者所述分布式数据库中的特定用于执行故障数据恢复的设备。In a possible implementation manner, the apparatus is applied to the first RS node after failure recovery, or applied to the execution of other devices in the distributed database, wherein the other devices include the distributed database. The second RS node in the distributed database that is not faulty, or a specific device in the distributed database for performing faulty data recovery.
在一种可能的实施方式中,所述恢复文件包括索引信息,所述索引信息用于指示各个待恢复分区所对应的数据记录区域在所述恢复文件中的位置偏移量。In a possible implementation manner, the recovery file includes index information, where the index information is used to indicate the position offset of the data recording area corresponding to each partition to be recovered in the recovery file.
在一种可能的实施方式中,所述索引信息具体用于指示所述多个待恢复分区分别对应的子文件在所述恢复文件中的位置偏移量,所述子文件用于存储待恢复分区中的数据处理记录,不同子文件用于存储不同的数据处理记录。In a possible implementation manner, the index information is specifically used to indicate the position offsets in the restoration file of subfiles corresponding to the multiple partitions to be restored, and the subfiles are used to store the to-be-restored subfiles Data processing records in the partition, different sub-files are used to store different data processing records.
在一种可能的实施方式中,所述装置还包括:存储模块,用于在将所述恢复文件传输至所述文件系统之前,将所述恢复文件存储于缓存;服务模块,用于利用所述缓存中存储的所述恢复文件为所述分布式数据库的客户提供服务。In a possible implementation manner, the apparatus further includes: a storage module, for storing the recovery file in a cache before transmitting the recovery file to the file system; a service module for using the The recovery file stored in the cache provides services to clients of the distributed database.
在一种可能的实施方式中,所述装置还包括:数据清空模块,用于在将所述恢复文件传输至所述文件系统后,清空所述缓存中存储的所述恢复文件。In a possible implementation manner, the apparatus further includes: a data clearing module, configured to clear the restored file stored in the cache after transferring the restored file to the file system.
在一种可能的实施方式中,所述获取模块,还用于获取主节点发送的数据恢复指令,所述数据恢复指令用于指示对所述第一RS节点中的多个待恢复分区进行数据恢复。In a possible implementation manner, the acquiring module is further configured to acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct to perform data recovery on multiple partitions to be recovered in the first RS node recover.
第三方面,本申请提供一种计算设备,所述计算设备包括处理器、存储器和显示器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行存储器中存储的指令,以使得计算设备执行如第一方面或第一方面的任一种实现方式中的数据恢复方法。In a third aspect, the present application provides a computing device including a processor, a memory and a display. The processor and the memory communicate with each other. The processor is configured to execute the instructions stored in the memory to cause the computing device to perform the data recovery method as in the first aspect or any one of the implementations of the first aspect.
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的数据恢复方法。In a fourth aspect, the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, when the computer-readable storage medium runs on a computing device, the computing device causes the computing device to perform the first aspect or any one of the first aspect. A data recovery method described in an implementation manner.
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面或第一方面的任一种实现方式所述的数据恢复方法。In a fifth aspect, the present application provides a computer program product containing instructions, which, when run on a computing device, enables the computing device to execute the data recovery method described in the first aspect or any implementation manner of the first aspect .
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided by the above aspects, the present application may further combine to provide more implementation manners.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some implementations described in the present application. For example, for those skilled in the art, other drawings can also be obtained from these drawings.
图1为本申请一示例性分布式数据库的架构示意图;1 is a schematic diagram of the architecture of an exemplary distributed database of the application;
图2为拆分WAL文件中属于不同分区的数据处理记录的示意图;2 is a schematic diagram of data processing records belonging to different partitions in the split WAL file;
图3为分布式数据库100的MTTR与RS节点数量之间的关系示意图;3 is a schematic diagram of the relationship between the MTTR of the distributed database 100 and the number of RS nodes;
图4为WAL文件中不同数据处理记录在恢复文件中不同数据记录区域中存储示意图;4 is a schematic diagram of storing different data processing records in the WAL file in different data recording areas in the recovery file;
图5为本申请实施例提供的一种数据恢复方法的流程示意图;5 is a schematic flowchart of a data recovery method according to an embodiment of the present application;
图6为待恢复分区与恢复文件中数据记录区域的对应关系示意图;6 is a schematic diagram of the corresponding relationship between the partition to be restored and the data recording area in the restored file;
图7为本申请实施例提供的又一种数据恢复方法的流程示意图;7 is a schematic flowchart of another data recovery method provided by an embodiment of the present application;
图8为本申请实施例提供的一种数据恢复装置的结构示意图;FIG. 8 is a schematic structural diagram of a data recovery apparatus according to an embodiment of the present application;
图9为本申请实施例提供的一种计算设备的硬件结构示意图。FIG. 9 is a schematic diagram of a hardware structure of a computing device according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。The solutions in the embodiments provided in this application will be described below with reference to the accompanying drawings in this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only a distinguishing manner adopted when describing objects with the same attributes in the embodiments of the present application.
如图1所示,为一示例性分布式数据库100的架构示意图。分布式数据库100包括主节点101、RS节点102以及RS节点103。其中,主节点101用于对该分布式数据库100所存储和管理的数据进行划分,得到多个分区,每个分区中包括一条或者多条数据,并且属于不同分区的数据通常存在差异。作为一种划分分区的实现示例,分布式数据库100中在存储和管理每条数据时,可以将该条数据中的部分内容作为该条数据对应的主关键字(primary key),该主关键字用于在分布式数据库100中对这条数据进行唯一标识,从而主节点101可以根据主关键字的可能取值范围进行区间划分,每个划分得到的区间对应于一个分区。例如,假设分布式数据库100中作为主关键字的取值范围为[0,1000000],则主节点101可以将主关键字的取值范围划分为100个区间,分别为[0,10000)、[10000,20000)、……、[980000,99000)、[990000,1000000],每个分区可以用于存储1万条数据,相应的,基于该100个分区,分布式数据库100可以存储并管理100万条数据。同时,主节点101还用于为RS节点102以及RS节点103分配分区,每个RS节点所分配到的分区可以通过主节点101所创建的管理表进行维护。RS节点102以及RS节点103分别用于执行属于不同分区的数据读写业务,如图1中RS节点102执行属于分区1至分区N的数据读写业务,而RS节点103执行属于分区N+1至分区M的数据读写业务。值得注意的是,图1中是以分布式数据库100包括1个主节点101以及两个RS节点为例进行说明,在其它可能的分布式数据库100,主节点101以及RS节点的数量也可以是任意值,本申请对此并不进行限定。As shown in FIG. 1 , it is a schematic diagram of the architecture of an exemplary distributed database 100 . The distributed database 100 includes a master node 101 , an RS node 102 and an RS node 103 . The master node 101 is used to divide the data stored and managed by the distributed database 100 to obtain multiple partitions, each partition includes one or more pieces of data, and the data belonging to different partitions are usually different. As an implementation example of partitioning, when storing and managing each piece of data in the distributed database 100, part of the content of the piece of data may be used as a primary key corresponding to the piece of data, and the primary key It is used to uniquely identify this piece of data in the distributed database 100, so that the master node 101 can perform interval division according to the possible value range of the primary key, and each divided interval corresponds to a partition. For example, assuming that the value range of the primary key in the distributed database 100 is [0, 1000000], the primary node 101 can divide the value range of the primary key into 100 intervals, which are [0, 10000), [10000, 20000), ..., [980000, 99000), [990000, 1000000], each partition can be used to store 10,000 pieces of data, correspondingly, based on the 100 partitions, the distributed database 100 can store and manage 1 million pieces of data. At the same time, the master node 101 is also used for allocating partitions to the RS node 102 and the RS node 103 , and the partitions allocated to each RS node can be maintained through the management table created by the master node 101 . The RS node 102 and the RS node 103 are respectively used to execute data read and write services belonging to different partitions. As shown in FIG. 1 , the RS node 102 executes data read and write services belonging to the partitions 1 to N, while the RS node 103 executes the data read and write services belonging to the partition N+1. Data read and write services to partition M. It is worth noting that, in FIG. 1, the distributed database 100 includes one master node 101 and two RS nodes as an example for illustration. In other possible distributed databases 100, the number of master nodes 101 and RS nodes may also be Any value, which is not limited in this application.
其中,主节点101、RS节点102以及RS节点103均可以通过硬件或者软件实现。作为一些示例,主节点101与各RS节点通过硬件实现时,主节点101与多个RS节点均可以是分布式数据库100中的物理服务器。即,在实际部署时,可以将分布式数据库100中的至少一个服务器配置为主节点101,并将该分布式数据库100中的其它服务器配置为RS节点。或者,主节点101与各RS节点通过软件实现时,主节点101与多个RS节点可以分别为运行在一台或者多台设备(如服务器等)上的进程或者虚拟机。The master node 101 , the RS node 102 and the RS node 103 can all be implemented by hardware or software. As some examples, when the master node 101 and each RS node are implemented by hardware, both the master node 101 and the multiple RS nodes may be physical servers in the distributed database 100 . That is, during actual deployment, at least one server in the distributed database 100 may be configured as the master node 101, and other servers in the distributed database 100 may be configured as RS nodes. Alternatively, when the master node 101 and each RS node are implemented by software, the master node 101 and multiple RS nodes may be processes or virtual machines running on one or more devices (eg, servers, etc.), respectively.
实际应用时,分布式数据库100可以作为本地资源,通过主节点101、RS节点102以及 RS节点103向接入分布式数据库100的客户端提供本地的数据读写服务。或者,分布式数据库100也可以部署于云端,此时,主节点101、RS节点102以及RS节点103可以向接入云端的客户端提供读写数据的云服务。In practical applications, the distributed database 100 can be used as a local resource to provide local data read and write services to clients accessing the distributed database 100 through the master node 101 , the RS node 102 and the RS node 103 . Alternatively, the distributed database 100 can also be deployed in the cloud. In this case, the master node 101 , the RS node 102 and the RS node 103 can provide cloud services for reading and writing data to clients accessing the cloud.
分布式数据库100可以分别与客户端200以及文件系统300连接,例如可以是通过超文本传输协议(HyperText Transfer Protocol,HTTP)等无线通信协议进行连接。假设客户端200存在向RS节点102修改数据或者写入新数据的需求时,客户端200可以向RS节点102发送数据写入请求,该数据写入请求中携带有待写入分布式数据库100中的数据(以下称之为待写入数据)以及相应的数据处理操作内容(如写操作、修改操作等)。RS节点102在接收到该数据写入请求后,可以先基于该数据写入请求中待写入数据以及数据处理操作生成对应的数据处理记录,并将该数据处理记录写入预先创建的WAL文件中。在确定写入WAL文件成功后,RS节点102将该WAL文件持久化存储至文件系统300;示例性的,文件系统300可以采用结构化日志合并树(log structured merge trees,LSM Trees)等数据结构存储WAL文件。同时,RS节点102将该数据处理记录中的待写入数据插入至RS节点102的内存1021中。例如,RS节点102可以先确定该数据处理记录对应的主关键字,并根据该关键字的取值所属的分区区间,确定将数据处理记录写入哪个分区,从而RS节点102可以将该数据处理记录中的待写入数据插入至该分区在内存1021中对应的存储区域;然后,RS节点102可以向客户端200反馈数据写入成功。The distributed database 100 may be connected to the client 200 and the file system 300 respectively, for example, the connection may be performed through a wireless communication protocol such as HyperText Transfer Protocol (HTTP). Assuming that the client 200 needs to modify data or write new data to the RS node 102, the client 200 can send a data write request to the RS node 102, and the data write request carries the data to be written in the distributed database 100. Data (hereinafter referred to as data to be written) and corresponding data processing operation content (such as write operation, modification operation, etc.). After receiving the data writing request, the RS node 102 may first generate a corresponding data processing record based on the data to be written and the data processing operation in the data writing request, and write the data processing record into a pre-created WAL file middle. After determining that the WAL file is successfully written, the RS node 102 persistently stores the WAL file in the file system 300; for example, the file system 300 may use a data structure such as log structured merge trees (LSM Trees). Store WAL files. At the same time, the RS node 102 inserts the data to be written in the data processing record into the memory 1021 of the RS node 102 . For example, the RS node 102 can first determine the primary key corresponding to the data processing record, and determine which partition to write the data processing record into according to the partition interval to which the value of the key belongs, so that the RS node 102 can process the data The data to be written in the record is inserted into the storage area corresponding to the partition in the memory 1021 ; then, the RS node 102 can report to the client 200 that the data writing is successful.
通常情况下,RS节点102会为一个或者多个客户端200在内存1021中写入数据,从而内存1021中所暂存的数据量会不断增加。当内存1021中的数据量达到预设阈值时,RS节点102可以将内存1021中的数据持久化存储至文件系统300,具体可以是以文件的形式写入分布式文件系统300。作为一些示例,该文件系统300例如可以是分布式文件系统(distributed file system,DFS)、Hadoop分布式文件系统(hadoop distributed file system,HDFS)等,本实施例对此并不进行限定。Normally, the RS node 102 writes data into the memory 1021 for one or more clients 200, so that the amount of data temporarily stored in the memory 1021 will increase continuously. When the amount of data in the memory 1021 reaches a preset threshold, the RS node 102 can persistently store the data in the memory 1021 to the file system 300 , specifically, the data can be written to the distributed file system 300 in the form of a file. As some examples, the file system 300 may be, for example, a distributed file system (distributed file system, DFS), a Hadoop distributed file system (hadoop distributed file system, HDFS), etc., which is not limited in this embodiment.
进一步地,RS节点102中针对每个分区还配置有分区存储文件(region store files),并且,在文件系统300中持久化存储数据后,RS节点102可以将每个分区的数据在文件系统300中存储时的文件添加至该分区所对应的分区存储文件中,具体可以是在分区存储文件的目录下添加该分区中各个数据对应的文件名。然后,RS节点102可以对每个分区存储文件所包括的文件进行合并以及删除,以剔除该分区存储文件中的旧版本数据。比如,客户端200在T1时刻请求RS节点102写入数据A,并在T2时刻请求RS节点102将数据A替换成数据B(T1<T2),则分区存储文件中可以包括数据A对应的文件a以及数据B对应的文件b,而RS节点102在合并分区存储文件时,具体可以是将文件a以及文件b合并为文件b,即仅保留新版本数据B所对应的文件b。Further, the RS node 102 is also configured with region store files for each partition, and after persistently storing the data in the file system 300, the RS node 102 can store the data of each partition in the file system 300. The file stored in the partition is added to the partition storage file corresponding to the partition, specifically, the file name corresponding to each data in the partition is added under the directory of the partition storage file. Then, the RS node 102 may merge and delete the files included in each partition storage file, so as to eliminate the old version data in the partition storage file. For example, if the client 200 requests the RS node 102 to write data A at time T1, and requests the RS node 102 to replace data A with data B at time T2 (T1<T2), the partition storage file may include the file corresponding to data A a and the file b corresponding to the data B, and when the RS node 102 merges the partition storage files, it may specifically merge the file a and the file b into the file b, that is, only the file b corresponding to the new version of the data B is retained.
实际应用时,分布式数据库100中的RS节点难免会发生掉电、维护重启等故障,这使得RS节点内存中的数据发生丢失,因此,主节点101在检测到RS节点102发生故障时,通常会启动一个被称为服务崩溃处理流程(server crash procedure,SCP)的进程为RS节点102进行数据恢复,具体是恢复RS节点102的内存1021中丢失的数据。具体的,SCP进程会从文件系统300中识别属于故障的RS节点102的所有WAL文件,并指定正常运行的RS节点103拆分对每个WAL文件中的数据处理记录。如图2所示,RS节点103具体是将WAL文件中属于 RS节点102中的不同分区的数据处理记录拆分出来,然后为每个分区单独创建一个恢复文件来保存WAL文件中属于该分区的数据处理记录,以便利用该恢复文件中的数据处理记录恢复出该分区中的数据(如通过回放数据处理操作恢复得到数据等),并基于恢复得到的数据继续为客户端200提供相应服务。然后,RS节点103会将每个分区对应的恢复文件逐个持久化存储至文件系统。具体的,在传输每个恢复文件时,RS节点103首先向文件系统300发送通知消息,以通知文件系统300当前需要进行文件传输;在得到文件系统300的反馈后,RS节点103将该恢复文件传输至文件系统300;最后,在确定该恢复文件成功传输至文件系统300后,RS节点103向文件系统300发送关闭通知,以通知文件系统300结束针对该恢复文件的传输。In practical application, the RS nodes in the distributed database 100 will inevitably suffer from failures such as power failure, maintenance restart, etc., which causes the data in the RS node memory to be lost. Therefore, when the master node 101 detects that the RS node 102 fails, usually A process called a server crash procedure (SCP) will be started to perform data recovery for the RS node 102, specifically to recover data lost in the memory 1021 of the RS node 102. Specifically, the SCP process will identify all the WAL files belonging to the failed RS node 102 from the file system 300, and designate the normal running RS node 103 to split the data processing records in each WAL file. As shown in FIG. 2, the RS node 103 specifically splits the data processing records belonging to different partitions in the RS node 102 in the WAL file, and then creates a separate recovery file for each partition to save the WAL file belonging to the partition. The data processing record is used to recover the data in the partition by using the data processing record in the recovery file (for example, the data is recovered by playing back the data processing operation, etc.), and the client 200 can continue to provide corresponding services based on the recovered data. Then, the RS node 103 persistently stores the recovery files corresponding to each partition to the file system one by one. Specifically, when transferring each restored file, the RS node 103 first sends a notification message to the file system 300 to notify the file system 300 that file transfer is currently required; after receiving the feedback from the file system 300, the RS node 103 restores the file Transfer to the file system 300; finally, after determining that the recovery file is successfully transferred to the file system 300, the RS node 103 sends a shutdown notification to the file system 300 to notify the file system 300 to end the transfer of the recovery file.
通常情况下,RS节点103针对每个WAL文件,均会生成N个恢复文件来记录该WAL文件中属于N个分区的数据,若属于故障RS节点102的WAL文件的数量为M,则在为该RS节点102进行数据恢复的过程中,RS节点103需要创建(M*N)个恢复文件。这样,RS节点103需要为(M*N)个恢复文件依次执行上述文件传输过程。实际应用时,若需要进行数据恢复的RS节点的数量为P,则分布式数据库100为P个RS节点进行数据恢复的过程中,需要生成并传输至文件系统300的恢复文件的总数为(M*N*P)。Normally, the RS node 103 generates N recovery files for each WAL file to record the data belonging to the N partitions in the WAL file. If the number of WAL files belonging to the faulty RS node 102 is M, then the number of WAL files belonging to the faulty RS node 102 is M. When the RS node 102 performs data recovery, the RS node 103 needs to create (M*N) recovery files. In this way, the RS node 103 needs to sequentially perform the above-mentioned file transfer process for (M*N) restored files. In practical application, if the number of RS nodes that need to perform data recovery is P, then the distributed database 100 is in the process of performing data recovery for P RS nodes, and the total number of recovery files that need to be generated and transmitted to the file system 300 is (M *N*P).
由于数据恢复过程中所需向文件系统传输的恢复文件的数量较多,而RS节点103针对每个恢复文件都需要执行上述通知、传输、关闭过程,这使得RS节点103向文件系统300传输文件的输入/输出(Input/Output,IO)较大,对于文件系统300的系统调用以及资源消耗较高。同时,RS节点103与文件系统300之间的大量交互过程,也会导致分布式数据库100的数据恢复效率较低。并且,随着分布式数据库100中RS节点的规模增大,分布式数据库100的平均恢复时长(mean time to recover,MTTR)增大的幅度逐渐变大(如近似呈指数级增长等),这限制了分布式数据库100的可扩展性。例如,实际测试中,针对100个RS节点的数据恢复测试结果表明,分布式数据库100的MTTR与分布式数据库100中包括的RS节点数量之间的关系如图3所示,近似于指数级增长,分布式数据库100的扩展性较低。Due to the large number of restored files that need to be transmitted to the file system in the data restoration process, the RS node 103 needs to perform the above-mentioned notification, transmission, and closing processes for each restored file, which makes the RS node 103 transmit the file to the file system 300. The input/output (IO) of the file system is relatively large, and the system calls and resource consumption of the file system 300 are relatively high. At the same time, a large number of interaction processes between the RS node 103 and the file system 300 will also lead to low data recovery efficiency of the distributed database 100 . Moreover, as the scale of the RS nodes in the distributed database 100 increases, the mean time to recover (MTTR) of the distributed database 100 increases gradually (such as approximately exponentially increasing, etc.), which The scalability of the distributed database 100 is limited. For example, in the actual test, the data recovery test results for 100 RS nodes show that the relationship between the MTTR of the distributed database 100 and the number of RS nodes included in the distributed database 100 is as shown in FIG. 3, which is approximately exponential growth , the scalability of the distributed database 100 is low.
为此,本申请实施例提供了一种数据恢复方法,以提高数据恢复效率、降低资源消耗。具体的,在SCP进程启动后(也急开始进行数据恢复),RS节点103先获取RS节点102在发生故障之前所创建的WAL文件,该WAL文件包括属于RS节点102中的N个待恢复分区的数据处理记录,每个待恢复分区也即为发生故障的RS节点102的内存1021中发生数据丢失的分区。然后,RS节点103可以将每个WAL文件中的各个待恢复分区的数据处理记录,写入恢复文件中,其中,该恢复文件中包括多个数据记录区域,并且,不同数据记录区域记录属于不同待恢复分区的数据处理记录,如图4所示。最后,RS节点103再将集成了一个或者多个WAL文件中数据处理记录的恢复文件持久化存储至文件系统300。相应的,RS节点102的内存1021中所丢失的各个分区的数据,即为各个恢复文件中属于该分区的数据处理记录中的待写入数据。To this end, the embodiments of the present application provide a data recovery method, so as to improve data recovery efficiency and reduce resource consumption. Specifically, after the SCP process is started (and data recovery is also urgently started), the RS node 103 first obtains the WAL file created by the RS node 102 before the failure occurs, and the WAL file includes the N partitions to be restored belonging to the RS node 102 Each partition to be restored is also a partition in which data loss occurs in the memory 1021 of the RS node 102 that has failed. Then, the RS node 103 may write the data processing records of the respective partitions to be restored in each WAL file into the restoration file, wherein the restoration file includes a plurality of data recording areas, and the records of different data recording areas belong to different The data processing record of the partition to be restored is shown in Figure 4. Finally, the RS node 103 persistently stores the recovery file integrated with the data processing records in one or more WAL files to the file system 300 . Correspondingly, the data of each partition lost in the memory 1021 of the RS node 102 is the data to be written in the data processing records belonging to the partition in each recovery file.
由于每个待恢复分区中的数据处理记录,均被写入同一恢复文件中的相应数据记录区域中,从而在基于一个(或者多个)WAL文件进行数据恢复时,可以仅向文件系统传输一个恢复文件,而可以不用针对多个文件执行多次传输。如此,当属于故障RS节点102的WAL文件的数量为M时,在为RS节点102进行数据恢复的过程中,所需传输的恢复文件的数量可 以由(M*N)降低为M(或者为小于M的值),从而能够有效减少数据恢复过程中所需向文件系统300传输的文件数量,进而可以减少RS节点103向文件系统300传输文件的IO、降低资源消耗,增加分布式数据库100的可扩展性,同时,分布式数据库100的数据恢复效率也会因为RS节点103与文件系统300之间的交互次数减少而得到提高。Since the data processing records in each to-be-restored partition are written into the corresponding data record areas in the same restoration file, when data restoration is performed based on one (or multiple) WAL files, only one WAL file can be transferred to the file system. Recover files without performing multiple transfers for multiple files. In this way, when the number of WAL files belonging to the faulty RS node 102 is M, in the process of data recovery for the RS node 102, the number of recovered files to be transmitted can be reduced from (M*N) to M (or to value less than M), thereby effectively reducing the number of files that need to be transferred to the file system 300 in the data recovery process, thereby reducing the IO of the RS node 103 to transfer files to the file system 300, reducing resource consumption, and increasing the number of files in the distributed database 100. Scalability, and at the same time, the data recovery efficiency of the distributed database 100 is also improved because the number of interactions between the RS node 103 and the file system 300 is reduced.
接下来,对数据恢复的各种非限定性的具体实施方式进行详细描述。Next, various non-limiting specific implementations of data recovery are described in detail.
参阅图5,为本申请实施例中一种数据恢复方法的流程示意图。该方法可以应用于上述图1所示的分布式数据库100中可以正常运行的任意RS节点,包括RS节点103以及RS节点102,比如RS节点102重启后恢复运行时,RS节点102可以自行根据WAL文件恢复内存1021中丢失的数据。或者,该方法也可以是应用于单独配置于分布式数据库100中并且特定用于执行故障数据恢复的设备。本实施例对此并不进行限定。为便于描述,下面以RS节点103执行数据恢复方法为例进行说明。图5所示的数据恢复方法具体可以包括:Referring to FIG. 5 , it is a schematic flowchart of a data recovery method in an embodiment of the present application. This method can be applied to any RS node that can operate normally in the distributed database 100 shown in FIG. 1, including the RS node 103 and the RS node 102. For example, when the RS node 102 resumes operation after restarting, the RS node 102 can automatically operate according to the WAL File recovery of lost data in memory 1021. Alternatively, the method may also be applied to a device separately configured in the distributed database 100 and specifically for performing fault data recovery. This embodiment does not limit this. For ease of description, the following description will be given by taking the data recovery method performed by the RS node 103 as an example. The data recovery method shown in FIG. 5 may specifically include:
S501:主节点101确定RS节点102发生故障,指示RS节点103进行数据恢复。S501: The master node 101 determines that the RS node 102 is faulty, and instructs the RS node 103 to perform data recovery.
作为一种实现示例,分布式数据库100中的各个RS节点可以周期性的向主节点101发送心跳消息。这样,主节点101在正常接收到RS节点102发送的心跳消息时,可以确定该RS节点102未发生故障;而当主节点101未接收到RS节点102的心跳消息时,可以确定RS节点102发生故障。As an implementation example, each RS node in the distributed database 100 may periodically send a heartbeat message to the master node 101 . In this way, when the master node 101 normally receives the heartbeat message sent by the RS node 102, it can determine that the RS node 102 is not faulty; and when the master node 101 does not receive the heartbeat message from the RS node 102, it can determine that the RS node 102 is faulty .
而在其它可能的故障检测方式中,当数据处理系统100中的RS节点102发生故障并且并未丧失与主节点101的通信功能(如RS节点非正常重启等)时,该RS节点102可以向主节点101发送故障通知,以通知主节点101其发生故障。本实施例中,对于主节点101如何实现检测出故障RS节点的具体实现方式并不进行限定。In other possible fault detection methods, when the RS node 102 in the data processing system 100 fails and does not lose the communication function with the master node 101 (for example, the RS node restarts abnormally, etc.), the RS node 102 can The master node 101 sends a failure notification to notify the master node 101 of its failure. In this embodiment, the specific implementation manner of how the master node 101 implements the detection of the faulty RS node is not limited.
实际应用时,当主节点101确定RS节点发生故障时,主节点101可以启动SCP进程,识别发生故障的RS节点102,并指示未发生故障的其它RS节点,也即图1中的RS节点103,执行相应的数据恢复过程。示例性地,主节点101可以向RS节点103发送数据恢复指令,以便于利用该数据恢复指令指示RS节点103对RS节点102中的多个待恢复分区进行数据恢复。In practical application, when the master node 101 determines that the RS node is faulty, the master node 101 can start the SCP process, identify the failed RS node 102, and indicate other RS nodes that have not failed, that is, the RS node 103 in FIG. 1, Perform the appropriate data recovery procedure. Exemplarily, the master node 101 may send a data recovery instruction to the RS node 103, so as to use the data recovery instruction to instruct the RS node 103 to perform data recovery on multiple partitions to be recovered in the RS node 102.
实际应用中,当分布式数据库100中存在除RS节点103之外的其它正常运行的RS节点时,主节点101可以是指示该其它RS节点执行该数据恢复过程。示例性地,当分布式数据库100中存在未发生故障的RS节点数量较多时,主节点101可以根据各个未发生故障的RS节点的负载,指示负载最小的RS节点执行数据恢复过程。In practical applications, when there are other normally operating RS nodes other than the RS node 103 in the distributed database 100, the master node 101 may instruct the other RS nodes to perform the data recovery process. Exemplarily, when there are a large number of non-faulty RS nodes in the distributed database 100, the master node 101 may instruct the RS node with the least load to perform the data recovery process according to the load of each non-faulty RS node.
S502:RS节点103获取RS节点102在发生故障之前创建的WAL文件,该WAL文件包括属于RS节点102中多个待恢复分区的数据处理记录。S502: The RS node 103 obtains a WAL file created by the RS node 102 before the failure occurs, where the WAL file includes data processing records belonging to multiple partitions to be restored in the RS node 102.
示例性地,主节点101在指示RS节点103进行数据恢复时,可以将发生故障的RS节点102的标识发送给RS节点103,例如将RS节点102的身份标号(Identity,ID)、出厂编号等提供给RS节点103,以指示RS节点103为哪个RS节点进行数据恢复,并可以进一步通过访问文件系统300获取属于故障RS节点102的WAL文件。由于该WAL文件是由RS节点102在发生故障之前完成创建,故WAL文件中记录了写入该RS节点102中各个分区的数据,从而RS节点103可以根据获取的WAL文件为RS节点102的多个分区进行数据恢复(为便于区分,以下将RS节点102中的分区称之为待恢复分区)。其中,属于RS节点102的WAL文件的数量可以是一 个或者多个。Exemplarily, when instructing the RS node 103 to perform data recovery, the master node 101 may send the identification of the failed RS node 102 to the RS node 103, for example, the RS node 102's identity label (Identity, ID), factory serial number, etc. It is provided to the RS node 103 to instruct the RS node 103 for which RS node to perform data recovery, and can further obtain the WAL file belonging to the faulty RS node 102 by accessing the file system 300 . Since the WAL file is created by the RS node 102 before the failure occurs, the WAL file records the data written to each partition in the RS node 102, so that the RS node 103 can use the obtained WAL file for the RS node 102. Data recovery is performed on each partition (for the convenience of distinction, the partition in the RS node 102 is hereinafter referred to as the partition to be recovered). Wherein, the number of WAL files belonging to the RS node 102 may be one or more.
作为一种获取WAL文件的示例,文件系统300在为各个RS节点存储数据时,可以为各个RS节点创建文件夹,并将RS节点创建的WAl文件添加至该RS节点对应的文件夹中。这样,RS节点103在获取属RS节点102创建的WAL文件时,可以访问文件系统300中RS节点102对应的文件夹,获得所需的WAL文件。当然,RS节点103也可以是采用其它方式获取WAL文件,比如,文件系统300所存储的各个RS节点创建的WAL文件时,该WAL文件的文件名中可以包括RS节点的标识(如RS节点的名称等),从而文件系统300可以在文件系统300中查找出具有RS节点102的标识的WAL文件。本实施例中,对于RS节点103获取WAL文件的具体实现方式并不进行限定。As an example of obtaining the WAL file, when storing data for each RS node, the file system 300 may create a folder for each RS node, and add the WAL file created by the RS node to the folder corresponding to the RS node. In this way, when the RS node 103 obtains the WAL file created by the RS node 102, it can access the folder corresponding to the RS node 102 in the file system 300 to obtain the required WAL file. Of course, the RS node 103 may also obtain the WAL file in other ways. For example, when the WAL file is created by each RS node stored in the file system 300, the file name of the WAL file may include the identifier of the RS node (such as the RS node's identifier). name, etc.), so that the file system 300 can find out the WAL file with the ID of the RS node 102 in the file system 300 . In this embodiment, the specific implementation manner for the RS node 103 to acquire the WAL file is not limited.
S503:RS节点103将WAL文件中属于各个待恢复分区的数据处理记录写入恢复文件中其中,该恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录。S503: The RS node 103 writes the data processing records belonging to each to-be-restored partition in the WAL file into the restoration file, wherein different data recording areas in the restoration file record data processing records belonging to different to-be-restored partitions.
作为一种示例,RS节点103在获得WAL文件后,可以根据该WAL文件确定RS节点102的内存1021中是否发生数据丢失,并在确定内存1021存在数据丢失的情况下,再根据WAL文件恢复该RS节点102的内存1021中丢失的数据。例如,RS节点103可以从获取的WAL文件中确定是否存在没有持久化标记的数据处理记录,若存在,表明该部分数据处理记录中的待写入数据尚未从内存1021中持久化存储至文件系统300,从而RS节点103可以基于这些没有持久化标记的数据处理记录来恢复内存1021中的数据。而对于具有持久化标记的数据处理记录,表明这些数据处理记录中的待写入数据已经完成由内存1021持久化存储至文件系统300的过程,则RS节点103可以无需再将这些数据处理记录中的待写入数据恢复至内存1021中。As an example, after obtaining the WAL file, the RS node 103 can determine whether data loss occurs in the memory 1021 of the RS node 102 according to the WAL file, and restore the data according to the WAL file when it is determined that there is data loss in the memory 1021 Lost data in the memory 1021 of the RS node 102 . For example, the RS node 103 can determine from the acquired WAL file whether there is a data processing record without a persistent mark, and if so, it indicates that the data to be written in the part of the data processing record has not been persistently stored in the memory 1021 to the file system 300, so that the RS node 103 can restore the data in the memory 1021 based on these data processing records without persistent marks. However, for the data processing records with persistent flags, it indicates that the data to be written in these data processing records has completed the process of being persisted and stored by the memory 1021 to the file system 300, and the RS node 103 does not need to store these data processing records in the file system 300. The to-be-written data is restored to the memory 1021.
进一步的,RS节点103所获取的WAL文件中,可能存在部分WAL文件中的数据处理记录与RS节点102发生故障前写入内存1021中的数据无关,如部分WAL文件中数据处理记录中所包括的待写入数据为旧版本的数据,而节点102故障前内存1021中的数据为新版本的数据等;或者,部分WAL文件中数据处理记录中所包括的待写入数据为RS节点102中已经被删除的分区的数据。由于该部分WAL文件中的数据处理记录对于数据恢复的作用不大,因此,RS节点103还可以对获取的WAL文件进行过滤,以减少参与数据恢复计算的WAL文件的数量,从而可以减少数据恢复过程中所需执行的计算量,降低资源消耗。示例性地,RS节点103在过滤旧版本的数据处理记录时,可以是获取WAL文件中具有相同主关键字的数据处理记录,并根据各个数据处理记录对应的时间戳确定旧版本的数据处理记录,从而对旧版本的数据处理记录进行过滤而保留新版本的数据处理记录。RS节点103在过滤已经被删除的分区的数据时,可以根据WAL文件中各个数据处理记录中分区标识(如分区名称等)是否为当前所要恢复的任意分区的分区标识相匹配,若数据处理记录的分区标识与每个分区的分区标识均不匹配,此时,该数据处理记录的分区标识通常为RS节点102中已经被删除的分区的分区标识,则RS节点103确定过滤该数据处理记录。Further, in the WAL file obtained by the RS node 103, there may be some data processing records in the WAL file irrelevant to the data written in the memory 1021 before the failure of the RS node 102, as included in the data processing records in some WAL files. The data to be written is the data of the old version, and the data in the memory 1021 of the node 102 before the failure is the data of the new version, etc.; or, the data to be written included in the data processing record in some WAL files is the RS node 102. Data of partitions that have been deleted. Since the data processing records in this part of the WAL files have little effect on data recovery, the RS node 103 can also filter the acquired WAL files to reduce the number of WAL files involved in the data recovery calculation, thereby reducing data recovery. The amount of computation that needs to be performed in the process reduces resource consumption. Exemplarily, when the RS node 103 filters the data processing records of the old version, it can obtain the data processing records with the same primary key in the WAL file, and determine the data processing records of the old version according to the time stamps corresponding to each data processing record. , so as to filter the data processing records of the old version and retain the data processing records of the new version. When the RS node 103 filters the data of the deleted partition, it can match according to whether the partition identifier (such as the partition name, etc.) in each data processing record in the WAL file is the partition identifier of any partition to be restored currently. The partition identifier of 1 does not match the partition identifier of each partition. At this time, the partition identifier of the data processing record is usually the partition identifier of the partition that has been deleted in the RS node 102, and the RS node 103 determines to filter the data processing record.
通常情况下,WAL文件中所记录的写入RS节点102的数据处理记录,可能属于RS节点102的不同待恢复分区,并且RS节点102在发生故障之前可以根据属于不同待恢复分区的数据为不同分区的客户端提供相应的数据读(以及写)服务。相应的,RS节点103在进行数据恢复时,可以对WAL文件中的数据处理记录进行拆分,确定属于各个待恢复分区的数据 处理记录。Normally, the data processing records written in the WAL file and written to the RS node 102 may belong to different partitions to be restored of the RS node 102, and the RS node 102 may be different according to the data belonging to the different partitions to be restored before the failure occurs. Partitioned clients provide corresponding data read (and write) services. Correspondingly, when the RS node 103 performs data recovery, it can split the data processing records in the WAL file, and determine the data processing records belonging to each partition to be recovered.
在一种可能的实施方式中,WAL文件中所记录的数据处理记录,可以是以键值对(key-value,KV)的形式存在,并且不同键值对可能属于该RS节点102中的不同待恢复分区。示例性地,该键值对中的键(key)可以指示WAL文件中的待恢复分区,例如可以是待恢复分区的标识等;该键值对中的值(value)为属于该待恢复分区的数据处理记录。则,RS节点103在对每个WAL文件进行数据处理记录的拆分时,可以读取该WAL文件中的每个键值对,并根据键值对中的键确定该键值对中的值所属的待恢复分区。In a possible implementation manner, the data processing records recorded in the WAL file may exist in the form of key-value pairs (key-value, KV), and different key-value pairs may belong to different Partition to be restored. Exemplarily, the key (key) in the key-value pair may indicate the to-be-restored partition in the WAL file, for example, the identifier of the to-be-restored partition, etc.; the value (value) in the key-value pair belongs to the to-be-restored partition. data processing records. Then, when the RS node 103 splits the data processing records for each WAL file, it can read each key-value pair in the WAL file, and determine the value in the key-value pair according to the key in the key-value pair The partition to be restored belongs to.
针对WAL文件中属于不同待恢复分区的各个数据处理记录,RS节点103可以将其存储于同一个文件(以下称之为恢复文件)中,即RS节点103针对一个或者多个WAL文件进行数据拆分时,仅为其创建一个恢复文件来记录不同待恢复分区的数据处理记录。其中,恢复文件包括多个互不重叠的数据记录区域,并且每个数据记录区域用于记录属于一个待恢复分区的数据处理记录,当然,不同数据记录区域用于记录属于不同待恢复分区的数据处理记录。其中,恢复文件中可以包括索引信息,并且该索引信息可以用于指示各个待恢复分区所对应的数据记录区域在恢复文件中的位置偏移量。For each data processing record in the WAL file belonging to different partitions to be restored, the RS node 103 may store them in the same file (hereinafter referred to as the restoration file), that is, the RS node 103 performs data disassembly for one or more WAL files For time-sharing, only one recovery file is created for it to record the data processing records of different partitions to be recovered. The recovery file includes multiple non-overlapping data recording areas, and each data recording area is used to record data processing records belonging to one partition to be recovered. Of course, different data recording areas are used to record data belonging to different partitions to be recovered. Process records. The recovery file may include index information, and the index information may be used to indicate the position offset of the data recording area corresponding to each partition to be recovered in the recovery file.
在一种可能的实施方式中,索引信息中所指示的相邻两个位置偏移量之间的位置空间,可以作为用于记录属于同一待恢复分区的数据处理记录的数据记录区域,该位置偏移量例如可以是数据记录区域的首地址等。举例来说,假设RS节点102包括4个待恢复分区,并且恢复文件中用于记录数据处理记录的位置空间包括逻辑地址A至逻辑地址E,则,待恢复分区与恢复文件中数据记录区域的对应关系可以如图6所示:属于待恢复分区f1中的数据处理记录,被存储于数据记录区域[逻辑地址A,逻辑地址B)中;属于待恢复分区f2中的数据处理记录,被存储于数据记录区域[逻辑地址B,逻辑地址C)中;属于待恢复分区f3中的数据处理记录,被存储于数据记录区域[逻辑地址C,逻辑地址D)中;属于待恢复分区f4中的数据处理记录,被存储于数据记录区域[逻辑地址D,逻辑地址E]中。其中,恢复文件中包括的索引信息可以包括4个键值对,其中,键值对1中的键为待恢复分区f1的标识,键值对1中的值为逻辑地址A,该索引信息中的其余键值对以此类推。这样,对于拆分WAL文件得到的属于各个待恢复分区的数据处理记录,将其写入恢复文件中与该待恢复分区对应的数据记录区域后,后续可以通过该数据处理记录所在的数据记录区域确定该数据所属的待恢复分区,而可以不用通过为每个待恢复分区均单独创建文件的方式来区分属于不同待恢复分区的数据处理记录。如此,可以减少RS节点103在基于该WAL文件进行数据恢复时所需创建的文件数量,从而可以减少文件的创建、移动、删除等过程,提高数据恢复效率。In a possible implementation manner, the position space between two adjacent position offsets indicated in the index information can be used as a data recording area for recording data processing records belonging to the same partition to be restored. The offset may be, for example, the head address of the data recording area or the like. For example, assuming that the RS node 102 includes 4 partitions to be restored, and the location space for recording data processing records in the restored file includes logical addresses A to E, then the partition to be restored and the data recording area in the restored file are Correspondence can be as shown in Figure 6: belong to the data processing record in the subarea f1 to be restored, be stored in the data recording area [logical address A, logical address B); belong to the data processing record in the subarea f2 to be restored, be stored In the data recording area [logical address B, logical address C); belong to the data processing record in the partition f3 to be restored, be stored in the data recording area [logical address C, logical address D); belong to the partition f4 to be restored. The data processing record is stored in the data recording area [logical address D, logical address E]. The index information included in the recovery file may include 4 key-value pairs, wherein the key in key-value pair 1 is the identifier of the partition f1 to be restored, the value in key-value pair 1 is the logical address A, and the index information contains and so on for the rest of the key-value pairs. In this way, for the data processing records belonging to each to-be-restored partition obtained by splitting the WAL file, after writing them into the data recording area corresponding to the to-be-restored partition in the restoration file, the data recording area where the data processing records are located can be used subsequently. The partition to be restored to which the data belongs is determined, and the data processing records belonging to different to-be-restored partitions can be distinguished without creating a separate file for each to-be-restored partition. In this way, the number of files that the RS node 103 needs to create when performing data recovery based on the WAL file can be reduced, thereby reducing processes such as file creation, movement, and deletion, and improving data recovery efficiency.
上述实施方式中,恢复文件中的每个数据记录区域可以直接用于存储数据处理记录,而在其它可能的实施方式中,RS节点103也可以是为每个待恢复分区均单独创建一份子文件,并且每个待恢复分区对应的子文件用于记录WAL文件中属于该待恢复分区的数据处理记录。这样,针对于M个WAL文件以及N个待恢复分区,RS节点103所生成的子文件的数量为(M*N)。为尽可能减少较多数量的子文件在传输至文件系统300的过程中所产生的大量交互,RS节点103可以将多个子文件打包成一个恢复文件,从而RS节点103在向文件系统300传输该多个子文件时,可以仅执行一次文件传输过程,以此减少RS节点103与文件系统300之间的交互次数;并且,可以尽可能减少对于已有方案的改动,提高方案实施的可行性。 作为一种实现示例,针对每个WAL文件,RS节点103可以将创建得到的N个子文件(对应于N个待恢复分区)写入恢复文件中相应的数据记录区域,此时,恢复文件中的索引信息,具体可以是每个待恢复分区对应的子文件在文件恢复中存储时的位置偏移量(也即数据记录区域)。其中,每个待恢复分区的子文件用于记录属于该待恢复分区中的数据处理记录,并且不同子文件所记录的数据处理记录不同。当然,在其它示例中,也可以是将多个WAL文件分别对应的N个子文件均写入同一恢复文件中,本实施例对此并不进行限定。In the above embodiment, each data record area in the recovery file can be directly used to store data processing records, and in other possible embodiments, the RS node 103 can also create a separate sub-file for each partition to be restored. , and the sub-file corresponding to each partition to be restored is used to record the data processing records belonging to the partition to be restored in the WAL file. In this way, for M WAL files and N partitions to be restored, the number of sub-files generated by the RS node 103 is (M*N). In order to minimize the large number of interactions generated during the transmission of a large number of sub-files to the file system 300, the RS node 103 can package a plurality of sub-files into a recovery file, so that the RS node 103 transmits the sub-files to the file system 300. When there are multiple sub-files, the file transfer process can be performed only once, thereby reducing the number of interactions between the RS node 103 and the file system 300; moreover, changes to existing solutions can be minimized to improve the feasibility of solution implementation. As an implementation example, for each WAL file, the RS node 103 may write the created N sub-files (corresponding to the N partitions to be restored) into the corresponding data recording areas in the restored file. The index information may specifically be the position offset (that is, the data recording area) of the subfile corresponding to each to-be-restored partition when it is stored in the file restoration. Wherein, the sub-file of each partition to be restored is used to record the data processing records belonging to the partition to be restored, and the data processing records recorded by different sub-files are different. Of course, in other examples, the N sub-files corresponding to the multiple WAL files may all be written into the same recovery file, which is not limited in this embodiment.
S504:RS节点103将恢复文件传输至文件系统300,以便于文件系统300持久化存储该恢复文件。S504: The RS node 103 transmits the recovery file to the file system 300, so that the file system 300 can persistently store the recovery file.
RS节点103可以基于上述过程对多个WAL文件进行拆分处理,从而可以恢复得到属于每个待恢复分区的数据处理记录,并将恢复文件持久化存储至文件系统。如此,RS节点103在基于M个WAL文件为N个待恢复分区进行数据恢复时,向文件系统300传输的文件数量(或者为传输文件的次数)不超过M。The RS node 103 can perform split processing on a plurality of WAL files based on the above process, so that data processing records belonging to each to-be-restored partition can be recovered, and the recovered files can be persistently stored in the file system. In this way, when the RS node 103 performs data recovery for the N partitions to be recovered based on the M WAL files, the number of files (or the number of times of transferring files) to the file system 300 does not exceed M.
针对内存1021中丢失的属于各个待恢复分区的数据,RS节点103可以回放该待恢复分区对应的各个数据处理记录中的数据处理操作,恢复得到RS节点102在发生故障时内存1021中丢失的数据。这样,RS节点103可以基于恢复得到的属于各个待恢复分区的待写入数据,为客户端200提供读数据、写数据、或删除数据的服务,如支持客户端200对于待恢复分区中待写入数据的查询等。而在RS节点103进行数据恢复期间,RS节点103可以为客户端200提供写数据、删除数据服务等。For the data lost in the memory 1021 belonging to each to-be-restored partition, the RS node 103 can play back the data processing operations in each data processing record corresponding to the to-be-restored partition, and recover the data lost in the memory 1021 of the RS node 102 when the failure occurs . In this way, the RS node 103 can provide the client 200 with services for reading data, writing data, or deleting data based on the recovered data to be written belonging to the respective partitions to be restored. query of incoming data, etc. While the RS node 103 is performing data recovery, the RS node 103 may provide the client 200 with services such as writing data, deleting data, and the like.
进一步地,RS节点103在将WAL文件对应的恢复文件持久化存储至文件系统300后,可以基于这些恢复文件实现待恢复分区的重新上线,即RS节点103能够基于该待恢复分区重新为分布式数据库100的客户提供读、写等服务,并且通知主节点101当前正常运行的分区中包括该待恢复分区,以便主节点101对该待恢复分区进行管理。Further, after the RS node 103 persistently stores the recovery file corresponding to the WAL file in the file system 300, it can realize the re-online of the partition to be recovered based on these recovery files, that is, the RS node 103 can re-distribute the partition based on the partition to be recovered. The client of the database 100 provides services such as read and write, and informs the master node 101 that the partition to be restored is included in the partition currently running normally, so that the master node 101 manages the to-be-restored partition.
在一种示例中,RS节点103在上线目标待恢复分区时,可以从文件系统300读取该目标待恢复分区对应的恢复文件,并基于该恢复文件中的数据处理记录重新上线待恢复分区。In one example, when the RS node 103 goes online to the target to-be-restored partition, it can read the recovery file corresponding to the target to-be-restored partition from the file system 300, and re-launch the to-be-restored partition based on the data processing record in the recovery file.
而在另一种示例中,RS节点103可以基于缓存中的恢复文件实现待恢复分区重新上线。具体的,RS节点103中配置有缓存,并且,RS节点103基于WAL文件生成相应的恢复文件后,可以在将恢复文件传输至文件系统之前,将该恢复文件存储于RS节点103的缓存中。这样,RS节点103在实现目标待恢复分区重新上线时,可以直接从缓存中读取恢复文件,并利用该恢复文件中记录的属于目标待恢复分区的数据处理记录为客户端200提供数据读取服务(包括涉及读数据的查询、修改等)。如此,RS节点103在重新上线目标待恢复分区时,可以无需通过远程调用的方式从分布式文件系统300中读取恢复文件以及该恢复文件的相关信息(如文件大小、数据长度等信息),从而可以减少系统调用以及相应的资源消耗。In another example, the RS node 103 may implement the partition to be restored back online based on the restoration file in the cache. Specifically, the RS node 103 is configured with a cache, and after generating the corresponding restoration file based on the WAL file, the RS node 103 may store the restored file in the cache of the RS node 103 before transmitting the restoration file to the file system. In this way, when the target partition to be recovered is brought back online, the RS node 103 can directly read the recovery file from the cache, and use the data processing records recorded in the recovery file that belong to the target partition to be recovered to provide data reading for the client 200 Services (including queries, modifications, etc. involving read data). In this way, when the RS node 103 re-launches the target to-be-restored partition, it can read the restoration file and the relevant information (such as file size, data length, etc.) from the distributed file system 300 without using a remote call. Thereby, system calls and corresponding resource consumption can be reduced.
进一步地,当RS节点103的缓存中存储的恢复文件持久化存储至文件系统300,或者RS节点103利用缓存中的恢复文件完成针对待恢复分区的重现上线后,RS节点103可以清空该缓存,以释放RS节点103的缓存资源,尽可能避免数据恢复过程中对于RS节点103的缓存资源的长期占用。作为一些示例,恢复文件的格式,例如可以是存档(archival)文件的格式,例如为“*.har”格式、“.tar”格式等。本实施例中,对于存档文件格式的具体实现并不进 行限定。Further, when the recovery file stored in the cache of the RS node 103 is persistently stored in the file system 300, or after the RS node 103 uses the recovery file in the cache to complete the reproduction of the partition to be recovered, the RS node 103 can clear the cache. , so as to release the buffer resources of the RS node 103 and avoid long-term occupation of the buffer resources of the RS node 103 during the data recovery process as much as possible. As some examples, the format of the recovery file may be, for example, the format of an archive (archival) file, such as "*.har" format, ".tar" format, and the like. In this embodiment, the specific implementation of the archive file format is not limited.
实际应用时,RS节点103在将目标待恢复分区进行重新上线后,还可以通知主节点101更新管理表中RS节点103所分配到、并且能够提供服务的分区,以便于主节点101基于更新后的管理表对各个RS节点所分配到的分区进行进一步管理,如主节点101基于更新后的管理表确定部分RS节点所分配到分区数量过多时,可以将该部分RS节点上的部分分区转移至其它RS节点,以均衡分区在各个RS节点上的分配等。In practical application, after the RS node 103 re-launches the target partition to be restored, it can also notify the master node 101 to update the partition in the management table that the RS node 103 is allocated to and can provide services, so that the master node 101 can update the partition based on the updated management table. The management table further manages the partitions allocated to each RS node. If the master node 101 determines that the number of partitions allocated to some RS nodes is too large based on the updated management table, the partial partitions on the part of the RS nodes can be transferred to other RS nodes to balance the allocation of partitions on each RS node, etc.
进一步地,RS节点103在将恢复文件持久化存储至分布式文件系统300后,还可以对每个待恢复分区对应的分区存储文件进行更新,具体可以是在该分区存储文件的目录下的文件进行合并以及删除,以剔除该分区存储文件中的旧版本数据。Further, after the RS node 103 persistently stores the recovery file in the distributed file system 300, it can also update the partition storage file corresponding to each partition to be recovered, specifically the file in the directory of the partition storage file. Merge and delete to remove old versions of data stored in the partition.
需要说明的是,上述实施例中,是以未发生故障的RS节点103执行数据恢复过程为例进行说明。在其它可能的实施例中,也可以是由单独配置于分布式数据库100中的设备为RS节点102进行数据恢复。或者,当RS节点102在发生故障后恢复运行时,主节点101也可以是指示RS节点102进行数据恢复。下面,以故障后恢复运行的RS节点102对自身分区中的数据进行恢复为例,对本申请实施例提供的另一种数据恢复方法进行说明。It should be noted that, in the above-mentioned embodiments, the data recovery process performed by the RS node 103 that is not faulty is taken as an example for description. In other possible embodiments, data recovery for the RS node 102 may also be performed by a device independently configured in the distributed database 100 . Alternatively, when the RS node 102 resumes operation after a failure, the master node 101 may also instruct the RS node 102 to perform data recovery. Hereinafter, another data recovery method provided by the embodiment of the present application will be described by taking the RS node 102 recovering operation after a failure recovering data in its own partition as an example.
参阅图7,为一种数据恢复方法的流程示意图,该方法主要应用于RS节点102,该方法具体可以包括:Referring to FIG. 7, it is a schematic flowchart of a data recovery method. The method is mainly applied to the RS node 102, and the method may specifically include:
S701:主节点101在确定RS节点102发生故障后,进一步确定RS节点102在预设时间段内是否恢复运行。S701: After determining that the RS node 102 is faulty, the master node 101 further determines whether the RS node 102 resumes operation within a preset time period.
S702:当确定RS节点102在预设时间段内恢复运行时,主节点101向RS节点102发送数据恢复指令,以指示RS节点102对多个待恢复分区进行数据恢复。S702: When it is determined that the RS node 102 resumes operation within the preset time period, the master node 101 sends a data recovery instruction to the RS node 102 to instruct the RS node 102 to perform data recovery on the multiple to-be-restored partitions.
其中,主节点101检测发生故障的RS节点的具体实现过程,可以参见前述实施例中步骤501的相关之处描述,在此不做赘述。Wherein, for the specific implementation process of the master node 101 detecting the faulty RS node, reference may be made to the description of the relevant part of step 501 in the foregoing embodiment, which will not be repeated here.
本实施例中,主节点101在确定RS节点102发生故障后,可以优先等待RS节点102在发生故障后的预设时间段内(如3分钟、5分钟等)是否能够恢复正常运行。若RS节点102在发生故障后能够恢复运行,则主节点101可以安排该RS节点102自行进行故障恢复,而若RS节点102在发生故障后未能及时恢复运行,则主节点101可以安排其它RS节点(如安排前述实施例中的RS节点103)为RS节点102中的待恢复分区进行故障恢复。In this embodiment, after determining that the RS node 102 is faulty, the master node 101 can preferentially wait for whether the RS node 102 can resume normal operation within a preset time period (eg, 3 minutes, 5 minutes, etc.) after the fault occurs. If the RS node 102 can resume operation after the failure, the master node 101 can arrange the RS node 102 to perform the failure recovery by itself, and if the RS node 102 fails to resume operation in time after the failure, the master node 101 can arrange other RS nodes The node (eg, the RS node 103 in the aforementioned embodiment) is arranged to perform fault recovery for the partition to be recovered in the RS node 102 .
其中,主节点101在指示RS节点102自行恢复数据时,可以向RS节点102发送数据恢复指令,以指示RS节点102自行恢复数据。Wherein, when instructing the RS node 102 to restore data by itself, the master node 101 may send a data restoration instruction to the RS node 102 to instruct the RS node 102 to restore the data by itself.
S703:RS节点102从文件系统300中获取RS节点102创建的WAL文件,该WAL文件包括属于RS节点102中多个待恢复分区的数据处理记录。S703: The RS node 102 obtains the WAL file created by the RS node 102 from the file system 300, where the WAL file includes data processing records belonging to multiple to-be-restored partitions in the RS node 102.
具体实现时,RS节点102根据接收到的数据恢复指令,访问文件系统300中该RS节点102对应的WAL文件夹,并从该WAL文件夹中读取已创建的多个WAL文件。通常情况下,RS节点102在为客户端200写入数据的过程中所创建的WAL文件均被添加至RS节点102对应的WAL文件夹下,因此,RS节点102可以根据该文件夹中包含的WAL文件,为RS节点102的多个待恢复分区进行数据恢复。During specific implementation, the RS node 102 accesses the WAL folder corresponding to the RS node 102 in the file system 300 according to the received data recovery instruction, and reads the created multiple WAL files from the WAL folder. Normally, the WAL files created by the RS node 102 in the process of writing data for the client 200 are added to the WAL folder corresponding to the RS node 102. Therefore, the RS node 102 can The WAL file performs data recovery for multiple partitions to be recovered of the RS node 102 .
值得注意的是,RS节点102在恢复运行后,可以从本地的文件系统300中读取WAL文件, 而可以无需通过远程访问的方式从文件系统300中获取WAL文件,如此,可以有效减少分布式数据库在进行数据恢复过程中所需占用的远程网络带宽。It is worth noting that, after the RS node 102 resumes operation, it can read the WAL file from the local file system 300, and can obtain the WAL file from the file system 300 without remote access. In this way, distributed distribution can be effectively reduced. The remote network bandwidth occupied by the database during data recovery.
进一步的,RS节点102还可以对获取的WAL文件进行过滤,以过滤掉与RS节点102发生故障前写入内存1021中的数据无关的部分WAL文件,从而可以减少参与数据恢复计算的WAL文件的数量,进而可以减少数据恢复过程中所需执行的计算量,降低资源消耗。Further, the RS node 102 can also filter the acquired WAL files to filter out some WAL files unrelated to the data written in the memory 1021 before the failure of the RS node 102, thereby reducing the number of WAL files involved in the data recovery calculation. In turn, the amount of computation that needs to be performed in the data recovery process can be reduced, and resource consumption can be reduced.
S704:RS节点102将WAL文件中属于多个待恢复分区的数据处理记录,写入恢复文件中,其中,恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录。S704: The RS node 102 writes data processing records belonging to multiple partitions to be restored in the WAL file into the restoration file, wherein different data recording areas in the restoration file record data processing records belonging to different partitions to be restored.
本实施例中,RS节点102可以将WAL文件中属于各个待恢复分区的数据处理记录直接写入预先创建的恢复文件中,并且,该恢复文件中包括多个不同的数据记录区域,而每个待恢复分区均对应于恢复文件中的至少一个数据记录区域,并且不同待恢复分区所对应的数据记录区域可以不存在重叠。相应的,RS节点102在将属于待恢复分区的数据处理记录在写入恢复文件时,具体可以是将该数据处理记录写入恢复文件中与该待恢复分区对应的数据记录区域中。如此,后续RS节点102可以数据在该恢复文件中的位置确定该数据处理记录所属的待恢复分区。实际应用时,待恢复分区与恢复文件中数据记录区域的对应关系,可以通过相应的索引信息进行记录,并且该索引信息可以集成于恢复文件中。In this embodiment, the RS node 102 can directly write the data processing records belonging to each partition to be restored in the WAL file into the pre-created restoration file, and the restoration file includes a plurality of different data recording areas, and each The to-be-restored partitions all correspond to at least one data recording area in the restoration file, and data recording areas corresponding to different to-be-restored partitions may not overlap. Correspondingly, when the RS node 102 records the data processing belonging to the to-be-restored partition in the write-recovery file, it may specifically write the data processing record into the data recording area corresponding to the to-be-restored partition in the recovery file. In this way, the subsequent RS node 102 can determine the to-be-restored partition to which the data processing record belongs based on the position of the data in the restoration file. In practical application, the corresponding relationship between the partition to be restored and the data recording area in the restored file can be recorded by corresponding index information, and the index information can be integrated into the restored file.
或者,在其它可能的实施方式中,RS节点102可以在拆分每个WAL文件中的数据处理记录时,均可以为每个待恢复分区创建一个子文件,并且每个待恢复分区对应的子文件用于记录该WAL文件中属于该待恢复分区的数据处理记录。此时,针对于M个WAL文件以及N个待恢复分区,RS节点102可能会创建(M*N)个子文件。本实施例中,针对于该较多数量的子文件,RS节点102可以将多个子文件添加至同一个恢复文件中,例如,RS节点102可以将N个子文件打包成一个恢复文件中,从而针对(M*N)个子文件,RS节点102会打包得到M个恢复文件。针对每个恢复文件,其包括多个数据记录区域以及相应的索引信息。其中,目标待恢复文件对应的子文件,可以被RS节点102添加至该恢复文件中与该目标待恢复分区对应的数据记录区域中,并且不同待恢复分区与恢复文件中的数据记录区域之间的一一对应关系,可以通过该恢复文件中的索引信息进行记录。如此,RS节点102在向文件系统300传输该多个子文件时,由于该多个子文件被打包成一个恢复文件,因此,RS节点102可以仅执行一次文件传输过程,以此减少RS节点102与文件系统300之间的交互次数;并且,可以尽可能减少对于已有方案的改动,提高方案实施的可行性。Alternatively, in other possible implementations, the RS node 102 may create a sub-file for each partition to be restored when splitting the data processing records in each WAL file, and the sub-file corresponding to each partition to be restored The file is used to record data processing records belonging to the to-be-restored partition in the WAL file. At this time, for the M WAL files and the N partitions to be restored, the RS node 102 may create (M*N) sub-files. In this embodiment, for the relatively large number of subfiles, the RS node 102 may add multiple subfiles to the same recovery file, for example, the RS node 102 may package N subfiles into one recovery file, so as to (M*N) sub-files, the RS node 102 will package to obtain M recovery files. For each recovery file, it includes a plurality of data recording areas and corresponding index information. Wherein, the sub-file corresponding to the target to-be-restored file may be added by the RS node 102 to the data recording area corresponding to the target to-be-restored partition in the restored file, and the difference between the to-be-restored partition and the data recording area in the restored file is different. The one-to-one correspondence can be recorded through the index information in the recovery file. In this way, when the RS node 102 transmits the multiple sub-files to the file system 300, since the multiple sub-files are packaged into a restored file, the RS node 102 can perform the file transfer process only once, thereby reducing the number of files between the RS node 102 and the file. The number of interactions between the systems 300; and, changes to existing solutions can be minimized, and the feasibility of solution implementation can be improved.
S705:RS节点102将恢复文件传输至文件系统300以进行持久化存储,并实现待恢复分区的重新上线。S705: The RS node 102 transmits the recovery file to the file system 300 for persistent storage, and realizes the re-online of the partition to be recovered.
其中,当RS节点102重新上线目标待恢复分区时,RS节点102可以从文件系统300读取该目标待恢复分区对应的恢复文件,并基于该恢复文件中的数据处理记录重新上线待恢复分区。或者,RS节点102在进行数据恢复时,可以在缓存中存储生成的恢复文件,从而RS节点102可以直接从缓存中读取恢复文件,并利用该恢复文件中记录的属于目标待恢复分区的数据处理记录为客户端200提供数据读取服务。Wherein, when the RS node 102 re-launches the target to-be-restored partition, the RS node 102 can read the recovery file corresponding to the target to-be-restored partition from the file system 300, and re-launch the to-be-restored partition based on the data processing records in the recovery file. Alternatively, when the RS node 102 performs data recovery, the generated recovery file may be stored in the cache, so that the RS node 102 may directly read the recovery file from the cache, and use the data recorded in the recovery file and belonging to the target partition to be recovered The processing record provides the client 200 with a data read service.
以上结合图1至图7对本申请实施例提供的数据恢复方法进行介绍,接下来结合附图对本申请实施例提供的用于执行上述数据恢复的数据恢复装置以及计算设备进行介绍。The data recovery method provided by the embodiments of the present application is described above with reference to FIGS. 1 to 7 . Next, the data recovery apparatus and computing device for performing the above data recovery provided by the embodiments of the present application are described with reference to the accompanying drawings.
图8为本申请提供的一种数据恢复装置的结构示意图,该数据恢复装置800可以应用于上述分布式数据库中的任意节点,如应用于故障后恢复运行的RS节点102或者未发生故障的RS节点103,或者应用于分布式数据库100中特定用于执行故障恢复数据的设备等。其中,该数据恢复装置800包括:FIG. 8 is a schematic structural diagram of a data recovery apparatus provided by the present application. The data recovery apparatus 800 can be applied to any node in the above-mentioned distributed database, such as the RS node 102 that resumes operation after a failure or an RS that does not fail. The node 103, or the device or the like applied to the distributed database 100 that is specifically used to perform fault recovery data. Wherein, the data recovery device 800 includes:
获取模块801,用于在服务崩溃处理流程启动后,获取分布式数据库中的第一分区服务器RS节点在发生故障前创建的预写日志WAL文件,所述WAL文件包括属于所述第一RS节点中的多个待恢复分区的数据处理记录;The obtaining module 801 is configured to obtain the write-ahead log WAL file created by the first partition server RS node in the distributed database before the failure occurs after the service crash processing process is started, and the WAL file includes files belonging to the first RS node. The data processing records of multiple partitions to be restored in ;
写入模块802,用于将所述WAL文件中各个待恢复分区的数据处理记录写入恢复文件中,其中,所述恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录;The writing module 802 is used to write the data processing records of each partition to be recovered in the WAL file into the recovery file, wherein different data recording area records in the recovery file belong to the data processing records of different partitions to be recovered;
传输模块803,用于将所述恢复文件传输至文件系统以进行持久化存储。The transmission module 803 is configured to transmit the recovery file to the file system for persistent storage.
在一种可能的实施方式中,所述数据恢复装置800应用于故障恢复后的所述第一RS节点(如图1中的RS节点102),或者,应用于所述分布式数据库中的其它设备执行,其中,所述其它设备包括所述分布式数据库中未发生故障的第二RS节点(如图1中的RS节点103),或者所述分布式数据库中的特定用于执行故障数据恢复的设备。In a possible implementation manner, the data recovery apparatus 800 is applied to the first RS node (such as the RS node 102 in FIG. 1 ) after failure recovery, or is applied to other nodes in the distributed database equipment execution, wherein the other equipment includes a second RS node (such as the RS node 103 in FIG. 1 ) that is not faulty in the distributed database, or a specific one in the distributed database for performing fault data recovery device of.
在一种可能的实施方式中,所述恢复文件包括索引信息,所述索引信息用于指示各个待恢复分区所对应的数据记录区域在所述恢复文件中的位置偏移量。In a possible implementation manner, the restoration file includes index information, where the index information is used to indicate the position offset of the data recording area corresponding to each partition to be restored in the restoration file.
在一种可能的实施方式中,所述索引信息具体用于指示所述多个待恢复分区分别对应的子文件在所述恢复文件中的位置偏移量,所述子文件用于存储待恢复分区中的数据处理记录,不同子文件用于存储不同的数据处理记录。In a possible implementation manner, the index information is specifically used to indicate the position offsets in the restoration file of subfiles corresponding to the multiple partitions to be restored, and the subfiles are used to store the to-be-restored subfiles Data processing records in the partition, different sub-files are used to store different data processing records.
在一种可能的实施方式中,所述数据恢复装置800还包括:In a possible implementation manner, the data recovery apparatus 800 further includes:
存储模块804,用于在将所述恢复文件传输至所述文件系统之前,将所述恢复文件存储于缓存;a storage module 804, configured to store the restored file in a cache before transmitting the restored file to the file system;
服务模块805,用于利用所述缓存中存储的所述恢复文件为所述分布式数据库的客户提供服务。The service module 805 is configured to provide services for the clients of the distributed database by using the restored files stored in the cache.
在一种可能的实施方式中,所述装置还包括:In a possible implementation, the device further includes:
数据清空模块806,用于在将所述恢复文件传输至所述文件系统后,清空所述缓存中存储的所述恢复文件。The data clearing module 806 is configured to clear the restored file stored in the cache after the restored file is transmitted to the file system.
在一种可能的实施方式中,所述获取模块801,还用于获取主节点发送的数据恢复指令,所述数据恢复指令用于指示对所述第一RS节点中的多个待恢复分区进行数据恢复。In a possible implementation manner, the acquiring module 801 is further configured to acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct the execution of multiple to-be-recovered partitions in the first RS node. Data Recovery.
根据本申请实施例的数据恢复装置800可对应于执行本申请实施例中描述的方法,并且数据恢复装置800的各个模块的上述和其它操作和/或功能分别为了实现图5、图7中RS节点102或RS节点103执行方法中的相应流程,为了简洁,在此不再赘述。The data recovery apparatus 800 according to the embodiments of the present application may correspond to executing the methods described in the embodiments of the present application, and the above-mentioned and other operations and/or functions of the various modules of the data recovery apparatus 800 are for realizing the RS in FIG. 5 and FIG. 7 , respectively. The node 102 or the RS node 103 executes the corresponding process in the method, which is not repeated here for brevity.
图9提供了一种计算设备。如图9所示,计算设备900例如可以是前述实施例中故障后恢复运行的RS节点102或者未发生故障的RS节点103,或者应用于分布式数据库100中特定用于执行故障恢复数据的设备等,并且计算机设备900具体可以用于实现上述图8所示实施例中数据恢复装置800的功能。Figure 9 provides a computing device. As shown in FIG. 9 , the computing device 900 may be, for example, the RS node 102 or the RS node 103 that does not fail in the previous embodiment, or a device specifically used for executing failure recovery data in the distributed database 100 etc., and the computer device 900 can be specifically used to implement the functions of the data recovery apparatus 800 in the above-mentioned embodiment shown in FIG. 8 .
计算设备900包括总线901、处理器902和存储器903。处理器902、存储器903之间通过 总线901通信。Computing device 900 includes bus 901 , processor 902 and memory 903 . A bus 901 communicates between the processor 902 and the memory 903.
总线901可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 901 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
处理器902可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 902 can be a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU), a microprocessor (micro processor, MP), or a digital signal processor (digital signal processor, DSP), etc. any one or more of the devices.
存储器903可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器903还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard drive drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 903 may include volatile memory, such as random access memory (RAM). The memory 903 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard drive (hard drive, HDD) or solid state drive (solid state drive) , SSD).
存储器903中存储有可执行的程序代码,处理器902执行该可执行的程序代码以执行前述数据恢复装置800所应用的RS节点102或RS节点103所执行的数据恢复方法。The memory 903 stores executable program codes, and the processor 902 executes the executable program codes to execute the data recovery method performed by the RS node 102 or the RS node 103 to which the data recovery apparatus 800 is applied.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述数据恢复方法。Embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be any available medium that a computing device can store, or a data storage device such as a data center that contains one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like. The computer-readable storage medium includes instructions that instruct a computing device to perform the data recovery method described above.
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在计算设备上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。The embodiments of the present application also provide a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computing device, all or part of the processes or functions described in the embodiments of the present application are generated.
所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机或数据中心进行传输。The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted over a wire from a website site, computer or data center. (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) to another website site, computer or data center.
所述计算机程序产品可以为一个软件安装包,在需要使用前述对象识别方法的任一方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。The computer program product can be a software installation package, which can be downloaded and executed on a computing device when any of the aforementioned object recognition methods needs to be used.
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。The descriptions of the processes or structures corresponding to each of the above-mentioned drawings have their own emphasis, and for the parts that are not described in detail in a certain process or structure, reference may be made to the related descriptions of other processes or structures.

Claims (17)

  1. 一种数据恢复方法,其特征在于,所述方法包括:A data recovery method, characterized in that the method comprises:
    在服务崩溃处理流程SCP启动后,获取分布式数据库中的第一分区服务器RS节点在发生故障前创建的预写日志WAL文件,所述WAL文件包括属于所述第一RS节点中的多个待恢复分区的数据处理记录;After the service crash processing process SCP is started, obtain the write-ahead log WAL file created by the first partition server RS node in the distributed database before the failure occurs, and the WAL file includes a plurality of pending log files belonging to the first RS node. Data processing records of recovery partitions;
    将所述WAL文件中各个待恢复分区的数据处理记录写入恢复文件中,其中,所述恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录;Write the data processing records of each partition to be recovered in the WAL file into the recovery file, wherein different data recording area records in the recovery file belong to the data processing records of different partitions to be recovered;
    将所述恢复文件传输至文件系统以进行持久化存储。The recovery file is transferred to the file system for persistent storage.
  2. 根据权利要求1所述的方法,其特征在于,所述方法由故障恢复后的所述第一RS节点执行,或者,由所述分布式数据库中的其它设备执行,其中,所述其它设备包括所述分布式数据库中未发生故障的第二RS节点,或者所述分布式数据库中的特定用于执行故障数据恢复的设备。The method according to claim 1, wherein the method is executed by the first RS node after failure recovery, or executed by other devices in the distributed database, wherein the other devices include The second RS node in the distributed database that is not faulty, or a specific device in the distributed database for performing faulty data recovery.
  3. 根据权利要求1或2所述的方法,其特征在于,所述恢复文件包括索引信息,所述索引信息用于指示各个待恢复分区所对应的数据记录区域在所述恢复文件中的位置偏移量。The method according to claim 1 or 2, wherein the restoration file includes index information, and the index information is used to indicate the position offset of the data recording area corresponding to each partition to be restored in the restoration file quantity.
  4. 根据权利要求3所述的方法,其特征在于,所述索引信息具体用于指示所述多个待恢复分区分别对应的子文件在所述恢复文件中的位置偏移量,所述子文件用于存储待恢复分区中的数据处理记录,不同子文件用于存储不同的数据处理记录。The method according to claim 3, wherein the index information is specifically used to indicate the position offsets of the sub-files respectively corresponding to the plurality of partitions to be restored in the restoration file, and the sub-files use It is used to store data processing records in the partition to be restored, and different subfiles are used to store different data processing records.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    在将所述恢复文件传输至所述文件系统之前,将所述恢复文件存储于缓存;before transferring the recovery file to the file system, storing the recovery file in a cache;
    利用所述缓存中存储的所述恢复文件为所述分布式数据库的客户提供服务。Clients of the distributed database are served using the recovery files stored in the cache.
  6. 根据权利要求5述的方法,其特征在于,所述方法还包括:The method according to claim 5, wherein the method further comprises:
    在将所述恢复文件传输至所述文件系统后,清空所述缓存中存储的所述恢复文件。After the recovery file is transferred to the file system, the recovery file stored in the cache is emptied.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 6, wherein the method further comprises:
    获取主节点发送的数据恢复指令,所述数据恢复指令用于指示对所述第一RS节点中的多个待恢复分区进行数据恢复。Acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct to perform data recovery on multiple to-be-recovered partitions in the first RS node.
  8. 一种数据恢复装置,其特征在于,所述装置包括:A data recovery device, characterized in that the device comprises:
    获取模块,用于在服务崩溃处理流程SCP启动后,获取分布式数据库中的第一分区服务器RS节点在发生故障前创建的预写日志WAL文件,所述WAL文件包括属于所述第一RS节点中的多个待恢复分区的数据处理记录;The obtaining module is used to obtain the write-ahead log WAL file created by the first partition server RS node in the distributed database before the failure occurs after the service crash processing process SCP is started, and the WAL file includes files belonging to the first RS node. The data processing records of multiple to-be-restored partitions in ;
    写入模块,用于将所述WAL文件中各个待恢复分区的数据处理记录写入恢复文件中,其中,所述恢复文件中的不同数据记录区域记录属于不同待恢复分区的数据处理记录;The writing module is used to write the data processing records of each partition to be recovered in the WAL file into the recovery file, wherein different data recording area records in the recovery file belong to the data processing records of different partitions to be recovered;
    传输模块,用于将所述恢复文件传输至文件系统以进行持久化存储。A transmission module, configured to transmit the recovery file to a file system for persistent storage.
  9. 根据权利要求8所述的装置,其特征在于,所述装置应用于故障恢复后的所述第一RS节点,或者,应用于所述分布式数据库中的其它设备执行,其中,所述其它设备包括所述分布式数据库中未发生故障的第二RS节点,或者所述分布式数据库中的特定用于执行故障数据恢复的设备。The apparatus according to claim 8, wherein the apparatus is applied to the first RS node after failure recovery, or applied to the execution of other devices in the distributed database, wherein the other devices Including the second RS node in the distributed database that is not faulty, or a specific device in the distributed database for performing faulty data recovery.
  10. 根据权利要求8或9所述的装置,其特征在于,所述恢复文件包括索引信息,所 述索引信息用于指示各个待恢复分区所对应的数据记录区域在所述恢复文件中的位置偏移量。The apparatus according to claim 8 or 9, wherein the restoration file includes index information, and the index information is used to indicate the position offset of the data recording area corresponding to each partition to be restored in the restoration file quantity.
  11. 根据权利要求10所述的装置,其特征在于,所述索引信息具体用于指示所述多个待恢复分区分别对应的子文件在所述恢复文件中的位置偏移量,所述子文件用于存储待恢复分区中的数据处理记录,不同子文件用于存储不同的数据处理记录。The device according to claim 10, wherein the index information is specifically used to indicate the position offsets of the sub-files corresponding to the plurality of partitions to be restored in the restored file, and the sub-files use It is used to store data processing records in the partition to be restored, and different subfiles are used to store different data processing records.
  12. 根据权利要求8至11任一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 8 to 11, wherein the device further comprises:
    存储模块,用于在将所述恢复文件传输至所述文件系统之前,将所述恢复文件存储于缓存;a storage module, configured to store the restored file in a cache before transmitting the restored file to the file system;
    服务模块,用于利用所述缓存中存储的所述恢复文件为所述分布式数据库的客户提供服务。A service module, configured to provide services for clients of the distributed database by using the restored files stored in the cache.
  13. 根据权利要求12述的装置,其特征在于,所述装置还包括:The apparatus of claim 12, wherein the apparatus further comprises:
    数据清空模块,用于在将所述恢复文件传输至所述文件系统后,清空所述缓存中存储的所述恢复文件。A data clearing module, configured to clear the restored file stored in the cache after transferring the restored file to the file system.
  14. 根据权利要求8至13任一项所述的装置,其特征在于,所述获取模块,还用于获取主节点发送的数据恢复指令,所述数据恢复指令用于指示对所述第一RS节点中的多个待恢复分区进行数据恢复。The apparatus according to any one of claims 8 to 13, wherein the acquiring module is further configured to acquire a data recovery instruction sent by the master node, where the data recovery instruction is used to instruct the first RS node data recovery from multiple partitions to be recovered.
  15. 一种计算设备,其特征在于,包括处理器、存储器;A computing device, comprising a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使所述计算设备执行如权利要求1至7任一项所述的方法。The processor is configured to execute instructions stored in the memory to cause the computing device to perform the method of any one of claims 1 to 7.
  16. 一种计算机可读存储介质,其特征在于,包括指令,当其在计算设备上运行时,使得所述计算设备执行如权利要求1至7中任一项所述的方法。A computer-readable storage medium comprising instructions which, when executed on a computing device, cause the computing device to perform the method of any one of claims 1 to 7.
  17. 一种包含指令的计算机程序产品,当其在计算设备上运行时,使得所述计算设备执行执行如权利要求1至7中任一项所述的方法。A computer program product comprising instructions which, when run on a computing device, cause the computing device to perform the method of any one of claims 1 to 7.
PCT/CN2021/084393 2021-01-13 2021-03-31 Data recovery method, apapratus and device, medium and program product WO2022151593A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180004239.4A CN115087966A (en) 2021-01-13 2021-03-31 Data recovery method, device, equipment, medium and program product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202131001638 2021-01-13
IN202131001638 2021-01-13

Publications (1)

Publication Number Publication Date
WO2022151593A1 true WO2022151593A1 (en) 2022-07-21

Family

ID=82447780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084393 WO2022151593A1 (en) 2021-01-13 2021-03-31 Data recovery method, apapratus and device, medium and program product

Country Status (2)

Country Link
CN (1) CN115087966A (en)
WO (1) WO2022151593A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628274A (en) * 2023-07-25 2023-08-22 浙江锦智人工智能科技有限公司 Data writing method, device and medium for graph database

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858252B (en) * 2023-02-21 2023-06-02 浙江智臾科技有限公司 Data recovery method, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092903A (en) * 2011-11-07 2013-05-08 Sap股份公司 Database Log Parallelization
CN103729442A (en) * 2013-12-30 2014-04-16 华为技术有限公司 Method for recording event logs and database engine
CN104516985A (en) * 2015-01-15 2015-04-15 浪潮(北京)电子信息产业有限公司 Rapid mass data importing method based on HBase database
US9804936B1 (en) * 2016-06-16 2017-10-31 International Business Machines Corporation Relational database recovery
CN110019063A (en) * 2017-08-15 2019-07-16 厦门雅迅网络股份有限公司 Method, terminal device and the storage medium of calculate node data disaster tolerance playback
CN110489274A (en) * 2019-07-11 2019-11-22 新华三大数据技术有限公司 Data back up method, device and interactive system
CN111221678A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Hbase data backup/recovery system, method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092903A (en) * 2011-11-07 2013-05-08 Sap股份公司 Database Log Parallelization
CN103729442A (en) * 2013-12-30 2014-04-16 华为技术有限公司 Method for recording event logs and database engine
CN104516985A (en) * 2015-01-15 2015-04-15 浪潮(北京)电子信息产业有限公司 Rapid mass data importing method based on HBase database
US9804936B1 (en) * 2016-06-16 2017-10-31 International Business Machines Corporation Relational database recovery
CN110019063A (en) * 2017-08-15 2019-07-16 厦门雅迅网络股份有限公司 Method, terminal device and the storage medium of calculate node data disaster tolerance playback
CN111221678A (en) * 2018-11-27 2020-06-02 阿里巴巴集团控股有限公司 Hbase data backup/recovery system, method and device and electronic equipment
CN110489274A (en) * 2019-07-11 2019-11-22 新华三大数据技术有限公司 Data back up method, device and interactive system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628274A (en) * 2023-07-25 2023-08-22 浙江锦智人工智能科技有限公司 Data writing method, device and medium for graph database
CN116628274B (en) * 2023-07-25 2023-09-22 浙江锦智人工智能科技有限公司 Data writing method, device and medium for graph database

Also Published As

Publication number Publication date
CN115087966A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
US8090917B2 (en) Managing storage and migration of backup data
US10140303B1 (en) Application aware snapshots
US11893264B1 (en) Methods and systems to interface between a multi-site distributed storage system and an external mediator to efficiently process events related to continuity
US8689047B2 (en) Virtual disk replication using log files
JP4448719B2 (en) Storage system
US11360867B1 (en) Re-aligning data replication configuration of primary and secondary data serving entities of a cross-site storage solution after a failover event
US10831741B2 (en) Log-shipping data replication with early log record fetching
JP5952960B2 (en) Computer system, computer system management method and program
US20120180070A1 (en) Single point, scalable data synchronization for management of a virtual input/output server cluster
JP2017531250A (en) Granular / semi-synchronous architecture
US11709743B2 (en) Methods and systems for a non-disruptive automatic unplanned failover from a primary copy of data at a primary storage system to a mirror copy of the data at a cross-site secondary storage system
US11934670B2 (en) Performing various operations at the granularity of a consistency group within a cross-site storage solution
WO2022151593A1 (en) Data recovery method, apapratus and device, medium and program product
JP2016513306A (en) Data storage method, data storage device, and storage device
CN113010496A (en) Data migration method, device, equipment and storage medium
US11442894B2 (en) Methods for scalable file backup catalogs and devices thereof
CN109992447B (en) Data copying method, device and storage medium
JP6376626B2 (en) Data storage method, data storage device, and storage device
JP2021060818A (en) Storage system and data migration method
US20230289263A1 (en) Hybrid data transfer model for virtual machine backup and recovery
WO2023070935A1 (en) Data storage method and apparatus, and related device
KR102084031B1 (en) Method for managing integrated local storage and apparatus therefor
CN113965582A (en) Mode conversion method and system, and storage medium
CA3025225A1 (en) Application aware snapshots
US11892917B2 (en) Application recovery configuration validation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21918792

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21918792

Country of ref document: EP

Kind code of ref document: A1