CN117692469A - Method and device for archiving data of big data cluster and fault disaster recovery - Google Patents

Method and device for archiving data of big data cluster and fault disaster recovery Download PDF

Info

Publication number
CN117692469A
CN117692469A CN202311543011.XA CN202311543011A CN117692469A CN 117692469 A CN117692469 A CN 117692469A CN 202311543011 A CN202311543011 A CN 202311543011A CN 117692469 A CN117692469 A CN 117692469A
Authority
CN
China
Prior art keywords
cluster
hdfs
juicefs
data
hdfs cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311543011.XA
Other languages
Chinese (zh)
Inventor
周朝卫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unihub China Information Technology Co Ltd
Original Assignee
Unihub China Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unihub China Information Technology Co Ltd filed Critical Unihub China Information Technology Co Ltd
Priority to CN202311543011.XA priority Critical patent/CN117692469A/en
Publication of CN117692469A publication Critical patent/CN117692469A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method and a device for archiving data of a big data cluster and fault disaster recovery, wherein the method comprises the following steps: constructing a juiceFS cluster on the same host of the existing Hadoop cluster, wherein data storage between the juiceFS cluster and the Hadoop cluster is mutually independent; the file system hook of the HDFS cluster senses the occurrence of file system operation on the HDFS cluster, and triggers the self-defined event processor to write the file system operation information on the HDFS cluster into Kafka; consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode; files on the HDFS cluster that have been archived to the JuiceFS cluster are periodically deleted. The method and the device share the Hadoop cluster host by the juiceFS cluster for data archiving, so that host resources are saved; incremental changes of the HDFS file can be automatically perceived, and the HDFS file can be synchronized to the juiceFS in real time; archiving of HDFS files supports parallel synchronization; disaster recovery switching and fault recovery can be automatically performed.

Description

Method and device for archiving data of big data cluster and fault disaster recovery
Technical Field
The invention relates to the field of data processing of large data clusters, in particular to a method and a device for data archiving and fault disaster recovery of a large data cluster.
Background
1. Historical data of a big data cluster needs to be timely archived and cleaned, so that the negative influence on NameNode performance caused by excessive data volume is avoided. Over time, the amount of data may increase gradually, and if not archived and cleaned in time, the cost of storing and processing the data may increase continuously. Therefore, it is an important task for large data clusters to archive and clean up historical data periodically to ensure efficient operation of the system.
2. The big data platform is constructed based on Hadoop, and due to the huge data volume, timeliness and accuracy of backup may not be guaranteed. Backup of data in large data platforms is a complex task, especially if the amount of data is increasing. Timeliness and accuracy of the backup are important for the integrity and restorability of the data. Therefore, the large data platform needs to consider how to effectively perform data backup to ensure the security and reliability of the data.
3. The data of the big data cluster is archived, the common practice is to newly add a Hadoop cluster specially used for backup, and the number of Hadoop cluster nodes used for data archiving is very large, however, such huge clusters are generally only used for storing data, which inevitably leads to waste of computing resources such as CPU and memory. Therefore, how to reduce the cost of the backup cluster is a problem to be solved under the premise of ensuring the reliability of data backup.
4. During a cluster upgrade or failure, a large data cluster will not be available, which can have some impact on business applications. The upgrading and fault handling of large data clusters is a complex process, and it is necessary to minimize interruption to the service while ensuring continuity and consistency of data. Thus, it is an important challenge for large data clusters to ensure that business applications do not break during upgrades and failures.
5. The NameNode fault has large influence area: the NameNode host of the big data cluster stores metadata, and once the NameNode fails, the data of the whole Hadoop cluster is not available. Thus, one way is ensured that in case of a NameNode failure, other nodes can normally provide computation and storage.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a device for archiving data of a big data cluster and fault tolerance.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in an embodiment of the present invention, a method for archiving data and fault tolerance of a big data cluster is provided, the method includes:
constructing a juiceFS cluster on the same host of the existing Hadoop cluster, wherein data storage between the juiceFS cluster and the Hadoop cluster is mutually independent;
the file system hook of the HDFS cluster senses the occurrence of file system operation on the HDFS cluster, and triggers the self-defined event processor to write the file system operation information on the HDFS cluster into Kafka;
consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode;
files on the HDFS cluster that have been archived to the JuiceFS cluster are periodically deleted.
Further, a proxy layer is added above the HDFS cluster and the juiceFS cluster, and the proxy layer routes requests to different storage systems according to the states of the HDFS cluster, so that disaster recovery switching and fault recovery are realized.
Further, depending on the state of the HDFS cluster, the proxy layer routes requests to different storage systems, including:
if the HDFS cluster is normal, the proxy layer sends the request to the corresponding node of the HDFS cluster;
if the HDFS cluster is not available, the proxy layer sends the request to the corresponding node of the JuiceFS cluster.
Further, the implementation logic of disaster recovery switching is as follows:
when the HDFS cluster is monitored to be unavailable, the agent layer automatically switches to send a request to the JuiceFS cluster;
during disaster recovery handover, the proxy layer records operations on the JuiceFS cluster.
Further, the implementation logic of fault recovery is as follows:
when the HDFS cluster is monitored to be recovered to be normal, the proxy layer is automatically switched to send a request to the HDFS cluster;
reversely synchronizing the file on the juiceFS cluster to the HDFS cluster according to the operation record of the agent layer on the juiceFS cluster;
during the process that files on the juiceFS cluster are reversely synchronized to the HDFS cluster, the proxy layer enters a safe mode, all files based on the HDFS cluster are queried and calculated, the HDFS cluster and the juiceFS cluster are queried simultaneously, and then the files are de-duplicated according to file names and returned to a query client;
after the file on the juiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the security mode, and all the file inquiry and calculation based on the HDFS cluster only need to inquire the HDFS cluster.
In an embodiment of the present invention, a device for archiving data of a big data cluster and fault tolerance is further provided, where the device includes:
the method comprises the steps that a juiceFS cluster construction module is used for constructing a juiceFS cluster on the same host of an existing Hadoop cluster, and data storage between the juiceFS cluster and the Hadoop cluster is mutually independent;
the data synchronization module is used for enabling a file system hook of the HDFS cluster to sense the occurrence of file system operation on the HDFS cluster, and triggering a self-defined event processor to write file system operation information on the HDFS cluster into Kafka; consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode; files on the HDFS cluster that have been archived to the JuiceFS cluster are periodically deleted.
Further, the apparatus further comprises:
and the fault disaster recovery module is used for adding a proxy layer on the HDFS cluster and the juiceFS cluster, and routing the request to different storage systems by the proxy layer according to the state of the HDFS cluster to realize disaster recovery switching and fault recovery.
Further, depending on the state of the HDFS cluster, the proxy layer routes requests to different storage systems, including:
if the HDFS cluster is normal, the proxy layer sends the request to the corresponding node of the HDFS cluster;
if the HDFS cluster is not available, the proxy layer sends the request to the corresponding node of the JuiceFS cluster.
Further, the implementation logic of disaster recovery switching is as follows:
when the HDFS cluster is monitored to be unavailable, the agent layer automatically switches to send a request to the JuiceFS cluster;
during disaster recovery handover, the proxy layer records operations on the JuiceFS cluster.
Further, the implementation logic of fault recovery is as follows:
when the HDFS cluster is monitored to be recovered to be normal, the proxy layer is automatically switched to send a request to the HDFS cluster;
reversely synchronizing the file on the juiceFS cluster to the HDFS cluster according to the operation record of the agent layer on the juiceFS cluster;
during the process that files on the juiceFS cluster are reversely synchronized to the HDFS cluster, the proxy layer enters a safe mode, all files based on the HDFS cluster are queried and calculated, the HDFS cluster and the juiceFS cluster are queried simultaneously, and then the files are de-duplicated according to file names and returned to a query client;
after the file on the juiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the security mode, and all the file inquiry and calculation based on the HDFS cluster only need to inquire the HDFS cluster.
In an embodiment of the present invention, a computer device is further provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the method for archiving data of the foregoing big data cluster and fault tolerance when executing the computer program.
In an embodiment of the present invention, a computer-readable storage medium is also presented, in which a computer program for performing the method of data archiving and fault tolerance of a big data cluster is stored.
The beneficial effects are that:
1. the method constructs the juiceFS cluster of the shared Hadoop cluster host, is used for archiving data and disaster recovery, and saves host resources.
2. The HDFS cluster provides a mechanism of file system hooks, which can automatically sense the occurrence of file system operation on the HDFS cluster and synchronize the file system operation information on the HDFS cluster to the JuiceFS cluster in real time.
3. The method and the device delete files on the HDFS cluster which is filed on the JuiceFS cluster at regular intervals, thereby improving the performance.
4. According to the invention, the file system operation information on the HDFS cluster is synchronized to the JuiceFS cluster, and based on the Kafka consumer group mode, the parallel synchronization of large data volume is realized.
5. The invention introduces the proxy layer, can automatically perform disaster recovery switching and fault recovery, and can achieve non-perception switching.
Drawings
FIG. 1 is a flow chart of a method for data archiving and fault tolerance of a big data cluster of the present invention;
FIG. 2 is a diagram of an architecture of the present invention incorporating a proxy layer;
FIG. 3 is a schematic diagram of a device for data archiving and fault tolerance of a big data cluster according to the present invention;
fig. 4 is a schematic diagram of the computer device structure of the present invention.
Detailed Description
The principles and spirit of the present invention will be described below with reference to several exemplary embodiments, with the understanding that these embodiments are merely provided to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that embodiments of the invention may be implemented as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a method and a device for archiving data and fault disaster recovery of a big data cluster are provided, a juiceFS cluster sharing a Hadoop cluster host is constructed, and the method and the device are used for archiving data and disaster recovery and saving host resources; the HDFS cluster provides a mechanism of a file system hook, which can automatically sense the occurrence of file system operation on the HDFS cluster and synchronize the file system operation information on the HDFS cluster to the JuiceFS cluster in real time; files on the HDFS cluster which are archived to the JuiceFS cluster are deleted regularly, so that the performance is improved; the file system operation information on the HDFS cluster is synchronized to the juiceFS cluster, and based on the Kafka consumer group mode, the parallel synchronization of large data volume is realized; the agent layer is introduced, so that disaster recovery switching and fault recovery can be automatically performed, and no-perception switching can be realized; when the HDFS cluster is recovered to be normal, the data can be recovered from the JuiceFS cluster, and the data can be ensured not to be lost.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments thereof.
FIG. 1 is a flow chart of a method for archiving data and fault tolerance of a large data cluster according to the present invention. As shown in fig. 1, the method includes:
s1, constructing a juiceFS cluster on the same host of the existing Hadoop cluster, wherein data storage between the juiceFS cluster and the Hadoop cluster are mutually independent;
s2, a file system hook of the HDFS cluster senses the occurrence of file system operation on the HDFS cluster, and triggers a self-defined event processor to write file system operation information on the HDFS cluster into Kafka;
s3, consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode;
s4, periodically deleting the files on the HDFS cluster which is archived to the JuiceFS cluster.
And S5, adding a proxy layer on the HDFS cluster and the juiceFS cluster, and routing the request to different storage systems by the proxy layer according to the state of the HDFS cluster to realize disaster recovery switching and fault recovery.
It should be noted that although the operations of the method of the present invention are described in a particular order in the above embodiments and the accompanying drawings, this does not require or imply that the operations must be performed in the particular order or that all of the illustrated operations be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
In order to more clearly explain the above-mentioned method for archiving data of big data clusters and fault tolerance, a specific embodiment is described below, however, it should be noted that this embodiment is only for better illustrating the present invention and is not meant to limit the present invention unduly.
Examples:
1. juiceFS cluster for constructing shared Hadoop cluster host
An independent JuiceFS cluster is built on the same host as the existing Hadoop cluster. Through configuration and adjustment, the data storage between the juiceFS cluster and the Hadoop cluster is ensured to be independent of each other and not to interfere with each other. The JuiceFS cluster is used for archiving and disaster recovery of data.
(1) Dividing hardware resources: the disks are planned on each host, one part being used for data storage of the HDFS cluster (or Hadoop cluster) and the other part being used for data storage of the JuiceFS cluster. This ensures that the data stores of the two clusters are isolated from each other. The HDFS cluster is service oriented, and a disk with higher performance, for example, an SSD disk is used as data storage of the HDFS cluster. The juiceFS cluster is used for data archiving, the performance requirement on the disk is not high, and a common SATA disk is used as the data storage of the juiceFS cluster. Meanwhile, since the juiceFS cluster is used for data archiving and disaster recovery, more capacity disk space can be allocated. For example, at each node, a disk in 1T space is allocated as the data storage of the HDFS cluster, and a disk in 10T space is allocated as the data storage of the JuiceFS cluster.
(2) Software configuration: the software that configures the Hadoop cluster and the JuiceFS cluster ensures that they run on the same host and use different directories or storage paths to store data. For example, a default storage path of an HDFS cluster (e.g.,/Hadoop/data) may be used in a Hadoop cluster, while a default storage path of a JuiceFS cluster (e.g.,/JuiceFS/data) may be used in a JuiceFS cluster.
(3) Rights and access control: proper authority and access control are set, so that data between the Hadoop cluster and the JuiceFS cluster are isolated from each other, and unexpected data coverage or access conflict can be prevented.
(4) Resource consumption control of the JuiceFS cluster: the JuiceFS cluster adopts a caching mechanism to improve performance. Because the juiceFS cluster is mainly used for data archiving, the probability of reading is small, and therefore, the cache of the juiceFS cluster is closed, and the occupation of the node memory is reduced.
Through the optimization measures, the Hadoop clusters are used, meanwhile, the juiceFS clusters are built on the same host, and separation and mutual noninterference of data storage are ensured. Thus, the existing hardware resources can be utilized to the maximum extent, and the two clusters can be managed and maintained conveniently.
2. HDFS cluster synchronizes data to JuiceFS cluster in real time
The HDFS cluster synchronizes data to the JuiceFS cluster in real time, and the following three problems need to be solved:
how the HDFS cluster perceives the incremental change of the file, namely the operation of the file system, and synchronizes to the JuiceFS cluster in real time;
when an archived file on an HDFS cluster is deleted;
how to achieve synchronization of large data volumes.
(1) How HDFS clusters perceive file delta changes
HDFS clusters provide a mechanism for file system hooks (File System Hooks) that can be configured to trigger custom event handlers to implement specific logic that is executed when file system operations occur, where file system operations, such as deletion, addition, and addition of files, are written to the message queue Kafka.
The following is a simple example to illustrate how to configure file system hooks on an HDFS cluster:
step 1: creating a custom event handler class:
first, a custom event handler class is created that will define the logic to be executed when a file system operation occurs. This class requires implementation of the org.apoche.hadoop.hdfs.server.naminode.inoditributeProvider interface (an interface provided by the Hadoop official for defining the logic to be executed when file system operations occur) provided by the HDFS cluster. In the custom event handler, the operation information of the file system is obtained by implementing the getAttributes method (a method for obtaining operation information of the file system, such as operation information of adding, deleting, updating, etc. of a file), and the operation information of the file system is written into the message queue Kafka.
Step 2: compiling and packaging custom event handlers
Custom event handler classes are compiled and packaged to generate a JAR file for subsequent configuration on the HDFS cluster.
Step 3: file system hooks for configuring HDFS clusters
In the configuration file (HDFS-site. Xml) of the HDFS cluster, the following configuration items are added to enable file system hooks and specify class names for custom event handlers:
<property>
<name>dfs.namenode.inode.attributes.provider.class</name>
<value>com.example.MyEventHandler</value>
</property>
note that the 'com.sample.myeventhandler' is replaced with the full class name of the custom event handler class.
Step 4: adding a JAR file of a custom event handler to a class path of the HDFS cluster:
copying the packed JAR file of the self-defined event processor into a class path of the HDFS cluster, so as to ensure that the HDFS cluster can be loaded into the JAR file.
Step 5: restarting HDFS cluster services
After the above configuration is completed, the HDFS cluster service is restarted, validating the configuration.
After the configuration is completed, whenever a file system operation (such as creating, modifying, deleting a file, etc.) occurs on the HDFS cluster, the file system hook of the HDFS cluster triggers the logic defined in the custom event handler to implement writing information of the file system operation into the message Kafka.
(2) Data synchronization to a JuiceFS cluster
The data synchronization program consumes the data of Kafka, acquires file system operation information (such as creating, modifying, deleting files and the like) on the HDFS cluster, acquires incremental changes of the files according to the operation information of the file system, and synchronizes the files with the incremental changes to the JuiceFS cluster.
Due to the immutable nature of files on the HDFS cluster, for existing files, by default, the overlay is synchronized again only if the file sizes are different. Meanwhile, the source file modification time can be designated or byte streams can be compared, and synchronization with extremely accurate requirements can be realized.
(3) Periodically deleting archived files on an HDFS cluster
The premise for deletion is that the file has been archived to the JuiceFS cluster.
The specific flow is as follows:
step 1: a delete time policy for files on the HDFS cluster is specified. For example, files that are more than 3 months old are deleted periodically.
Step 2: and searching the file from the juiceFS cluster by using the file on the HDFS cluster, comparing the information such as the size, the modification time and the like of the file, and judging whether the file is archived or not.
Step 3: for files that exceed the time threshold and have been archived, a file delete operation is performed.
(4) How to achieve synchronization of large data volumes
In a large data volume scene, in order to improve the real-time synchronization performance of data, a multi-machine concurrency mechanism is adopted, and the specific implementation process is as follows:
and the data synchronization program starts a plurality of threads or processes, and consumes file system operation information from the Kafka by using the same consumption group. With the consumer group model of Kafka, concurrent synchronization can be achieved by partitioning the Topic of each Kafka into partitions corresponding to one consuming thread or process.
3. Fault disaster recovery
The method for realizing fault disaster recovery between the HDFS cluster and the juiceFS cluster is to introduce a proxy layer and route the request to a corresponding storage system according to the state of the HDFS cluster.
Since the JuiceFS cluster supports Hadoop SDKs. Therefore, the storage of the HDFS cluster, the query program, the calculation engines such as Spark, flank and the like can realize the switching of file paths only by modifying the domain name of the file system without modifying codes.
For example, file path of HDFS cluster: hdfs:// enclosed: 8082/data/a.log, switch to the file system of the JuiceFS cluster, only need to modify domain name: the value of the juicefs is/os is 9090/data/a.log.
The following is a detailed description based on this idea:
(1) Architecture overview:
as shown in fig. 2, a proxy layer is added above the HDFS cluster and the JuiceFS cluster, the proxy layer provides services to the outside in a unified way, and operations such as adding, deleting, querying, modifying and the like of data directly provide services through the proxy layer, but do not directly interface with the HDFS cluster or the JuiceFS cluster. The agent layer is used for receiving requests of data inquiry, addition, deletion and the like and routing the requests to different storage systems according to the states of the HDFS clusters. The agent layer can be an independent service or component, and can be implemented by selecting a proper technical stack according to actual situations.
(2) Monitoring the state of the HDFS cluster:
the proxy layer needs to be able to monitor the state of the HDFS cluster in order to determine whether it is operating properly, and may use the monitoring index, heartbeat mechanism, or other monitoring mechanism provided by the HDFS cluster to detect the availability of the HDFS cluster.
(3) Data routing logic:
depending on the state of the HDFS cluster, the proxy layer may take the following logic to route the data:
if the HDFS cluster is normal: the request is sent to the HDFS cluster. The proxy layer may send a request to the corresponding node of the HDFS to ensure that data is written into the HDFS cluster.
If HDFS clusters are not available: the request is sent to the JuiceFS cluster. The proxy layer may send a request to the corresponding node of the JuiceFS cluster to ensure that data is written into the JuiceFS cluster.
(4) Disaster recovery switching:
in the proxy layer, logic is needed to implement disaster recovery switching and failure recovery to ensure that the system can process data correctly after failure or recovery:
when it is detected that the HDFS cluster is not available, the proxy layer should immediately switch to send the request to the JuiceFS cluster. This may be achieved by dynamically adjusting the routing logic or configuration. Meanwhile, the operations of the agent layer on the juiceFS cluster, such as the operations of adding, deleting, adding and the like of the file, are recorded in the disaster recovery switching period.
(5) Fault recovery:
once the HDFS cluster is restored to normal, the proxy layer should automatically switch back to sending the request to the HDFS. The agent layer may use the monitoring metrics provided by the HDFS, a heartbeat mechanism, or other monitoring mechanism to detect an availability restoration of the HDFS cluster and adjust the data routing logic accordingly.
However, during disaster recovery, all data writes are directly written to the JuiceFS cluster, if the route switches to the HDFS cluster, since the HDFS cluster has no data writes during disaster recovery. Thus, after failure recovery, file queries and computations based on HDFS clusters will be unavailable due to data loss.
To solve this problem, the following operations are performed:
(a) During disaster recovery switching, operations of the agent layer on the juiceFS cluster, such as operations of adding, deleting and adding files, are recorded. And reversely synchronizing the files on the juiceFS cluster to the HDFS cluster according to the operation record.
(b) During the reverse synchronization of files on the JuiceFS cluster to the HDFS cluster, the proxy layer enters a secure mode, i.e., during this period, all the files based on the HDFS cluster need to be queried and calculated simultaneously, then the files are de-duplicated according to the file name and returned to the querying client, so that it can be ensured that the data is not lost and not repeated after the failure is recovered.
(c) After the file on the JuiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the secure mode, and all file querying and computing based on the HDFS cluster are switched to querying only the HDFS cluster.
(6) Monitoring and alarming:
in order to discover faults and anomalies in time, the agent layer should be provided with monitoring and alarm mechanisms. By monitoring the running state of the agent layer, the data transmission condition and the availability of the HDFS cluster, the problems can be found and processed in time.
Based on the same inventive concept, the invention also provides a device for archiving the data of the big data cluster and fault disaster recovery. The implementation of the device can be referred to as implementation of the above method, and the repetition is not repeated. The term "module" as used below may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 3 is a schematic diagram of the apparatus for data archiving and fault tolerance of a large data cluster according to the present invention. As shown in fig. 3, the apparatus includes:
the juiceFS cluster construction module 101 is configured to construct a juiceFS cluster on the same host of an existing Hadoop cluster, where data storage between the juiceFS cluster and the Hadoop cluster are independent;
the data synchronization module 102 is configured to sense occurrence of a file system operation on the HDFS cluster by using a file system hook of the HDFS cluster, and trigger a custom event processor to write file system operation information on the HDFS cluster into Kafka; consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode; periodically deleting files on the HDFS cluster which is already filed to the JuiceFS cluster;
the fault disaster recovery module 103 is configured to add a proxy layer above the HDFS cluster and the JuiceFS cluster, and the proxy layer routes the request to different storage systems according to the state of the HDFS cluster, so as to implement disaster recovery switching and fault recovery;
depending on the state of the HDFS cluster, the proxy layer routes requests to different storage systems, including:
if the HDFS cluster is normal, the proxy layer sends the request to the corresponding node of the HDFS cluster;
if the HDFS cluster is not available, the proxy layer sends the request to the corresponding node of the JuiceFS cluster.
The realization logic of disaster recovery switching is as follows:
when the HDFS cluster is monitored to be unavailable, the agent layer automatically switches to send a request to the JuiceFS cluster;
during disaster recovery handover, the proxy layer records operations on the JuiceFS cluster.
The implementation logic of fault recovery is as follows:
when the HDFS cluster is monitored to be recovered to be normal, the proxy layer is automatically switched to send a request to the HDFS cluster;
reversely synchronizing the file on the juiceFS cluster to the HDFS cluster according to the operation record of the agent layer on the juiceFS cluster;
during the process that files on the juiceFS cluster are reversely synchronized to the HDFS cluster, the proxy layer enters a safe mode, all files based on the HDFS cluster are queried and calculated, the HDFS cluster and the juiceFS cluster are queried simultaneously, and then the files are de-duplicated according to file names and returned to a query client;
after the file on the juiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the security mode, and all the file inquiry and calculation based on the HDFS cluster only need to inquire the HDFS cluster.
It should be noted that while several modules of an apparatus for data archiving and fault tolerance of large data clusters are mentioned in the above detailed description, such partitioning is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present invention. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.
Based on the foregoing inventive concept, as shown in fig. 4, the present invention further proposes a computer device 200, including a memory 210, a processor 220, and a computer program 230 stored in the memory 210 and capable of running on the processor 220, where the processor 220 implements the method of archiving data of the foregoing big data cluster and fault tolerance when executing the computer program 230.
Based on the foregoing inventive concept, the present invention also proposes a computer-readable storage medium storing a computer program for executing the foregoing method of data archiving and fault tolerance of a big data cluster.
The method and the device for archiving the data of the big data cluster and fault disaster recovery provided by the invention have the following bright points:
1. and constructing a juiceFS cluster of the shared Hadoop cluster host, which is used for archiving data and disaster recovery and saving host resources.
2. The HDFS cluster provides a mechanism for a file system hook, which can automatically sense the occurrence of file system operations on the HDFS cluster, and synchronize file system operation information on the HDFS cluster to the JuiceFS cluster in real time.
3. Files on the HDFS cluster which are archived to the JuiceFS cluster are deleted periodically, so that the performance is improved.
4. The file system operation information on the HDFS cluster is synchronized to the JuiceFS cluster, and based on the Kafka consumer group mode, the parallel synchronization of large data volume is realized.
5. The proxy layer is introduced, so that disaster recovery switching and fault recovery can be automatically performed, and no-perception switching can be realized.
While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
It should be apparent to those skilled in the art that various modifications or variations can be made in the present invention without requiring any inventive effort by those skilled in the art based on the technical solutions of the present invention.

Claims (12)

1. A method for archiving data and fault tolerance of a large data cluster, the method comprising:
constructing a juiceFS cluster on the same host of the existing Hadoop cluster, wherein data storage between the juiceFS cluster and the Hadoop cluster is mutually independent;
the file system hook of the HDFS cluster senses the occurrence of file system operation on the HDFS cluster, and triggers the self-defined event processor to write the file system operation information on the HDFS cluster into Kafka;
consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode;
files on the HDFS cluster that have been archived to the JuiceFS cluster are periodically deleted.
2. The method for archiving and recovering fault in large data clusters according to claim 1, wherein a proxy layer is added above the HDFS clusters and the JuiceFS clusters, and the proxy layer routes requests to different storage systems according to the states of the HDFS clusters, so as to implement disaster recovery switching and fault recovery.
3. The method of data archiving and fault tolerance for large data clusters according to claim 2, wherein the proxy layer routes requests to different storage systems depending on the state of the HDFS cluster, comprising:
if the HDFS cluster is normal, the proxy layer sends the request to the corresponding node of the HDFS cluster;
if the HDFS cluster is not available, the proxy layer sends the request to the corresponding node of the JuiceFS cluster.
4. The method for archiving data and fault tolerance of big data clusters according to claim 2, wherein the implementation logic of the disaster tolerance switch is as follows:
when the HDFS cluster is monitored to be unavailable, the agent layer automatically switches to send a request to the JuiceFS cluster;
during disaster recovery handover, the proxy layer records operations on the JuiceFS cluster.
5. The method for archiving data and fault tolerance of a large data cluster according to claim 2, wherein the fault recovery is implemented as follows:
when the HDFS cluster is monitored to be recovered to be normal, the proxy layer is automatically switched to send a request to the HDFS cluster;
reversely synchronizing the file on the juiceFS cluster to the HDFS cluster according to the operation record of the agent layer on the juiceFS cluster;
during the process that files on the juiceFS cluster are reversely synchronized to the HDFS cluster, the proxy layer enters a safe mode, all files based on the HDFS cluster are queried and calculated, the HDFS cluster and the juiceFS cluster are queried simultaneously, and then the files are de-duplicated according to file names and returned to a query client;
after the file on the juiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the security mode, and all the file inquiry and calculation based on the HDFS cluster only need to inquire the HDFS cluster.
6. An apparatus for archiving data of a large data cluster and fault tolerance, the apparatus comprising:
the method comprises the steps that a juiceFS cluster construction module is used for constructing a juiceFS cluster on the same host of an existing Hadoop cluster, and data storage between the juiceFS cluster and the Hadoop cluster is mutually independent;
the data synchronization module is used for enabling a file system hook of the HDFS cluster to sense the occurrence of file system operation on the HDFS cluster, and triggering a self-defined event processor to write file system operation information on the HDFS cluster into Kafka; consuming file system operation information on the HDFS cluster from the Kafka, and synchronizing to the JuiceFS cluster based on the Kafka consumer group mode; files on the HDFS cluster that have been archived to the JuiceFS cluster are periodically deleted.
7. The apparatus for archiving data and fault tolerance of a large data cluster according to claim 6, further comprising:
and the fault disaster recovery module is used for adding a proxy layer on the HDFS cluster and the juiceFS cluster, and routing the request to different storage systems by the proxy layer according to the state of the HDFS cluster to realize disaster recovery switching and fault recovery.
8. The apparatus for data archiving and fault-tolerant of large data clusters of claim 7, wherein the proxy layer routes requests to different storage systems based on the state of the HDFS clusters, comprising:
if the HDFS cluster is normal, the proxy layer sends the request to the corresponding node of the HDFS cluster;
if the HDFS cluster is not available, the proxy layer sends the request to the corresponding node of the JuiceFS cluster.
9. The apparatus for archiving data and fault tolerance of big data clusters according to claim 7, wherein the implementation logic of the disaster tolerance switch is as follows:
when the HDFS cluster is monitored to be unavailable, the agent layer automatically switches to send a request to the JuiceFS cluster;
during disaster recovery handover, the proxy layer records operations on the JuiceFS cluster.
10. The apparatus for archiving data and fault tolerance of a large data cluster according to claim 7, wherein the fault recovery is implemented as follows:
when the HDFS cluster is monitored to be recovered to be normal, the proxy layer is automatically switched to send a request to the HDFS cluster;
reversely synchronizing the file on the juiceFS cluster to the HDFS cluster according to the operation record of the agent layer on the juiceFS cluster;
during the process that files on the juiceFS cluster are reversely synchronized to the HDFS cluster, the proxy layer enters a safe mode, all files based on the HDFS cluster are queried and calculated, the HDFS cluster and the juiceFS cluster are queried simultaneously, and then the files are de-duplicated according to file names and returned to a query client;
after the file on the juiceFS cluster is reversely synchronized to the HDFS cluster, the proxy layer exits the security mode, and all the file inquiry and calculation based on the HDFS cluster only need to inquire the HDFS cluster.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.
12. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for performing the method of any one of claims 1-5.
CN202311543011.XA 2023-11-20 2023-11-20 Method and device for archiving data of big data cluster and fault disaster recovery Pending CN117692469A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311543011.XA CN117692469A (en) 2023-11-20 2023-11-20 Method and device for archiving data of big data cluster and fault disaster recovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311543011.XA CN117692469A (en) 2023-11-20 2023-11-20 Method and device for archiving data of big data cluster and fault disaster recovery

Publications (1)

Publication Number Publication Date
CN117692469A true CN117692469A (en) 2024-03-12

Family

ID=90136186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311543011.XA Pending CN117692469A (en) 2023-11-20 2023-11-20 Method and device for archiving data of big data cluster and fault disaster recovery

Country Status (1)

Country Link
CN (1) CN117692469A (en)

Similar Documents

Publication Publication Date Title
US11153380B2 (en) Continuous backup of data in a distributed data store
US10831614B2 (en) Visualizing restoration operation granularity for a database
JP6522812B2 (en) Fast Crash Recovery for Distributed Database Systems
US11755415B2 (en) Variable data replication for storage implementing data backup
US9965203B1 (en) Systems and methods for implementing an enterprise-class converged compute-network-storage appliance
US9658928B2 (en) File-based cluster-to-cluster replication recovery
US10817478B2 (en) System and method for supporting persistent store versioning and integrity in a distributed data grid
US20190188406A1 (en) Dynamic quorum membership changes
US7739677B1 (en) System and method to prevent data corruption due to split brain in shared data clusters
JP6404907B2 (en) Efficient read replica
US7257689B1 (en) System and method for loosely coupled temporal storage management
US8825652B1 (en) Small file aggregation in a parallel computing system
US11080253B1 (en) Dynamic splitting of contentious index data pages
US11567899B2 (en) Managing dependent delete operations among data stores
US11409711B2 (en) Barriers for dependent operations among sharded data stores
US20210165768A1 (en) Replication Barriers for Dependent Data Transfers between Data Stores
Domaschka et al. Reliability and availability properties of distributed database systems
CN113377868A (en) Offline storage system based on distributed KV database
US10223184B1 (en) Individual write quorums for a log-structured distributed storage system
CN111597270A (en) Data synchronization method, device, equipment and computer storage medium
CN117692469A (en) Method and device for archiving data of big data cluster and fault disaster recovery
Sebepou et al. Scalable storage support for data stream processing
Tian et al. Overview of Storage Architecture and Strategy of HDFS
CN113076065B (en) Data output fault tolerance method in high-performance computing system
CN116360917A (en) Virtual machine cluster management method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination