CN110515557B - Cluster management method, device and equipment and readable storage medium - Google Patents

Cluster management method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN110515557B
CN110515557B CN201910785358.2A CN201910785358A CN110515557B CN 110515557 B CN110515557 B CN 110515557B CN 201910785358 A CN201910785358 A CN 201910785358A CN 110515557 B CN110515557 B CN 110515557B
Authority
CN
China
Prior art keywords
node
metadata
root
mode
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910785358.2A
Other languages
Chinese (zh)
Other versions
CN110515557A (en
Inventor
王新忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Inspur Data Technology Co Ltd
Original Assignee
Beijing Inspur Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Inspur Data Technology Co Ltd filed Critical Beijing Inspur Data Technology Co Ltd
Priority to CN201910785358.2A priority Critical patent/CN110515557B/en
Publication of CN110515557A publication Critical patent/CN110515557A/en
Application granted granted Critical
Publication of CN110515557B publication Critical patent/CN110515557B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]

Abstract

The application discloses a cluster management method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: adding the I/O request into a to-be-processed linked list; controlling a transaction module and a write cache module in the metadata of the survival node to stop waiting for receiving a message sent by a fault node, switching the write cache of the metadata into a log mode, and switching a main node mode into a mode without the fault node; reading a root area of metadata of a fault node as a main node into a memory of a survival node; controlling the write cache module to open a disk-brushing switch and controlling the transaction module to perform rollback redoing on the unfinished transaction; and issuing the I/O request in the linked list to be processed. According to the technical scheme disclosed by the application, all services of the fault node are managed through the operation processing of the live node, the transaction module and the write cache module in the metadata of the live node, so that the high availability of the full flash memory system is realized, and the reliability of the full flash memory system is improved.

Description

Cluster management method, device and equipment and readable storage medium
Technical Field
The present application relates to the field of full flash storage technologies, and in particular, to a cluster management method, apparatus, device, and computer-readable storage medium.
Background
With the development of information, the full flash memory system with the characteristics of strong processing capability, good expansibility and maintainability and the like is widely applied. The storage system is used as a bottom foundation of related computer services, and has high requirements on reliability.
For a full flash storage system, metadata is the most important part of the system, and there is currently no effective way to achieve high availability of a full flash storage system from a metadata perspective.
In summary, how to implement high availability of a full flash memory system from the perspective of metadata to improve reliability thereof is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, an object of the present application is to provide a cluster management method, apparatus, device and computer readable storage medium to achieve high availability of a full flash storage system from the perspective of metadata, thereby improving reliability thereof.
In order to achieve the above purpose, the present application provides the following technical solutions:
a cluster management method is applied to a full flash storage system based on a cluster, and comprises the following steps:
when a fault node exists in the cluster, adding an I/O request newly received by a surviving node into a to-be-processed linked list;
controlling a transaction module and a write cache module in the live node metadata to stop waiting for receiving the message sent by the fault node, switching the write mode of the metadata into a log mode, and switching the main node mode into a mode without the fault node;
reading a root area of metadata of the fault node, which is a main node, into a memory of the survival node;
controlling the write cache module to open a disk brushing switch, and controlling the transaction module to perform rollback redo on an unfinished transaction;
and issuing the I/O request in the linked list to be processed.
Preferably, before reading the root area of the metadata of the node with the primary node as the failed node into the memory of the surviving node, the method further includes:
and dividing a preset area from the logical address of the disk to be used as a root area of the metadata, and storing a root node of the metadata in the root area.
Preferably, the storing the root node of the metadata in the root zone includes:
and storing the root node in the root zone in a double-copy mode.
Preferably, the root node includes LunID, CRC check value, MagicNumber.
Preferably, the method further comprises the following steps:
when the fault node is recovered to be normal, adding a newly received I/O request into a to-be-processed linked list;
waiting for the write cache module in the metadata of the surviving node to complete the task of being refreshed, and removing the read cache data of the failure node which is a main node in the metadata read cache module;
switching the writing mode of the metadata to a mirror mode, and switching the main node mode to a mode containing the fault node after the normal recovery;
and synchronizing the writing mode and the main node mode to the fault node after the normal recovery, and recovering the main node as a root area of the metadata of the fault node from the disk to a memory of the main node by the fault node after the normal recovery.
Preferably, the recovering, by the failed node, the root area of the metadata of the failed node, which is the master node, from the disk to the memory of the failed node includes:
the fault node traverses the logic address of the disk, and reads a main node as a root area of metadata of the fault node into a memory of the fault node; wherein the root zone comprises the root node in a double-copy form;
the fault node carries out CRC check value check and MagicNumber check on the root node in a double-copy mode;
if the root nodes in the double copy forms pass the verification, the fault node stores the root node with the later timestamp in the memory of the fault node; if only one root node in the root nodes in the double copy mode passes the verification, the fault node stores the root node passing the verification in the memory of the fault node.
A cluster management device is applied to a full flash storage system based on a cluster, and comprises:
the first adding module is used for adding an I/O request newly received by a surviving node into a to-be-processed linked list when a fault node exists in the cluster;
the first control module is used for controlling the transaction module and the write cache module in the metadata of the surviving node to stop waiting for receiving the message sent by the fault node, switching the write mode of the metadata into a log mode and switching the main node mode into a mode without the fault node;
the reading module is used for reading the main node as the root area of the metadata of the fault node into the memory of the survival node;
the second control module is used for controlling the write cache module to open a disk refreshing switch and controlling the transaction module to perform rollback redo on an unfinished transaction;
and the issuing module is used for issuing the I/O request in the linked list to be processed.
Preferably, the method further comprises the following steps:
and the dividing and storing module is used for dividing a preset area from a logic address of a disk as a root area of the metadata before reading the root area of the metadata of which the main node is the fault node into the memory of the survival node, and storing the root node of the metadata in the root area.
A cluster management device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the cluster management method according to any one of the above when executing the computer program.
A computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the cluster management method according to any of the preceding claims.
The application provides a cluster management method, a device, equipment and a computer readable storage medium, wherein the method is applied to a full flash storage system based on a cluster, and comprises the following steps: when a fault node exists in the cluster, adding an I/O request newly received by a surviving node into a to-be-processed linked list; controlling a transaction module and a write cache module in the metadata of the survival node to stop waiting for receiving the message sent by the fault node, switching the write cache of the metadata into a log mode, and switching a main node mode into a mode without the fault node; reading a root area of metadata of a fault node as a main node into a memory of a survival node; controlling the write cache module to open a disk-brushing switch and controlling the transaction module to perform rollback redoing on the unfinished transaction; and issuing the I/O request in the linked list to be processed. According to the technical scheme disclosed by the application, when the fault node exists in the cluster, all services of the fault node are managed through the live node, the transaction module and the write cache module in the metadata of the live node, so that the high availability of the full flash storage system is realized, and the reliability of the full flash storage system is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a cluster management method according to an embodiment of the present application;
fig. 2 is a flowchart for taking over a service of a failed node according to an embodiment of the present application;
fig. 3 is a flowchart of a failed node service switch back provided in an embodiment of the present application;
fig. 4 is a schematic diagram illustrating metadata root zone recovery performed by a failed node after restoration according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a cluster management device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a cluster management device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, which shows a flowchart of a cluster management method provided in an embodiment of the present application, applied to a full flash storage system based on a cluster, and including:
s11: and when the fault node exists in the cluster, adding the I/O request newly received by the surviving node into the to-be-processed linked list.
For a cluster-based full flash storage system, a plurality of nodes (i.e., a plurality of controllers) are included, and the nodes are divided into master nodes and slave nodes.
When one node in the cluster fails (the failed node is the failed node), in order to achieve high availability of the full flash storage system, the surviving node in the cluster (i.e., the surviving node) needs to take over all the services of the failed node, and in the process of taking over, the node is allowed to receive the I/O request sent by the upper layer service.
Considering that, in the system running process, for each process stage, operations may be performed, some configuration information needs to be read, and in order to be able to safely modify the configuration information, it is necessary to ensure that modifications are not performed in the running process, therefore, when a failed node exists in the cluster, the surviving node enters a silent state to notify the corresponding module to perform corresponding processing (specifically, to complete processing on a task being processed, and not to start processing on the task), so as to achieve a task that is not being processed within the whole process, so as to safely modify the task.
In the process that the surviving node performs the state in silence, the upper layer service can still issue a new I/O request (only the new I/O request issued by the upper layer service is not responded temporarily), at this time, the surviving node adds the newly received I/O request into the to-be-processed linked list, so as to issue and process the I/O request after all services of the failed node are taken over.
S12: and controlling a transaction module and a write cache module in the metadata of the survival node to stop waiting for receiving the message sent by the fault node, switching the write mode of the metadata into a log mode, and switching the main node mode into a mode without the fault node.
Considering that there is a message transmission and response between the master node and the slave node under normal conditions, there are cases where it is necessary to wait for a peer message, for example: for node a and node B, node a sends a message to node B, and after the operation of node B is completed, it needs to reply a corresponding message to node a, so that node a can know that node B has received the message and completed the processing. However, when a node fails, the failed node cannot send a message to the surviving node, and at this time, the surviving node may actively cancel the waiting, and specifically, the surviving node may control the transaction module and the write cache module in the metadata corresponding to the surviving node to stop waiting for receiving the message sent by the failed node, and may mark the message as a failure.
For metadata, for general I/O services, the metadata needs to manage the mapping relationship of logical addresses to physical addresses (i.e., LPs); for garbage collection function, metadata needs to manage the mapping relationship from physical address to logical address (i.e. PL); for the supported deduplication functionality, the metadata needs to manage the mapping of the fingerprint values of the I/O to the physical addresses (i.e., HP). Therefore, for one I/O, operations such as LP, PL and HP need to be modified for many times, and therefore, transactions are needed to ensure atomicity.
The metadata can be internally divided into the following modules: a metadata object module, which is responsible for managing metadata objects, including LUN (logical Unit Number, which can be briefly expressed as a logical volume) information, root nodes of a metadata tree structure (the metadata form of the present application takes B + tree as an example), and operations such as initializing, updating, and recovering the root zone data structure; the transaction module, in combination with the above, needs a transaction mechanism to ensure atomicity, because one request can be divided into multiple sub-requests: if the completion is finished, the completion is completed; if one of the sub-requests is not completed, the sub-request fails, rollback redo is needed, and at the moment, the completed sub-request also needs to cancel operation; a write cache module: the system is responsible for caching the processing of business I/O in a memory, and is divided into a WRITE _ BACK (WRITE BACK mode) mode and a WRITE _ THROUGH (WRITE THROUGH mode) mode according to business requirements, wherein in the WRITE _ BACK mode, the WRITE cache is divided into a certain memory space and caches the operation sent by a transaction module, and the WRITE is performed under the condition that a certain condition is met, while the WRITE _ THROUGH mode directly performs the WRITE of a request sent by the transaction module; b + tree module: the method is responsible for realizing the whole metadata B + tree operation algorithm; a reading and caching module: the method is responsible for reading and caching the metadata; the query module: and is responsible for the query operation of the metadata.
After the transaction module and the write cache module in the control metadata of the surviving node stop waiting for receiving the message sent by the failed node, the surviving node enters a silent state, so that no I/O being processed can be ensured, and at the moment, the modification of the configuration information can be safely carried out. Specifically, the surviving node may switch the write mode of the metadata to a log mode (i.e., a LOGGING mode, which indicates that the operation log of the transaction module is to be protected by writing the operation log to a disk, and in the LOGGING mode, the write cache module corresponds to a write-through mode), specifically, the dual-control mirroring mode of the transaction module needs to be switched to a single-control log mode, and meanwhile, the dual-control write-back mode of the write cache module needs to be switched to a single-control write-through mode. When the failure node fails, the writing mode of the metadata is a mirror mode (i.e. a cache mode, which means that synchronization is performed only by writing the operation log of the transaction module into a controller (i.e. a node) of the opposite end, and in the cache mode, the writing cache module corresponds to a write-back mode).
In addition, the surviving node needs to switch the master node mode to a mode that does not include the failed node, that is, the failed node needs to be excluded when the master node is subsequently allocated, and at this time, for a case that the cluster includes two nodes, the master node mode is switched to a mode that only includes the surviving node (that is, switched to the single node mode).
It should be noted that after the switching between the write mode and the master node mode is completed, the modified I/O request is processed in a new mode, so as to achieve high availability of the system and improve the reliability of the system.
S13: and reading the root area of the metadata of the fault node which is the main node into the memory of the survival node.
After completing the write mode and the master node mode switch, the surviving node needs to restore the root zone of the metadata. Considering that the root area of the metadata also has a master-slave partition, each node in the cluster only stores the master node as the root area of the current node, so that before the failure node fails, the surviving node stores the root area of the metadata of the surviving node as the master node, the failure node stores the root area of the metadata of the failure node as the master node, and when the failure node fails and the surviving node needs to recover the root area of the metadata, so as to take over the service of the failure node, the surviving node needs to read the root area of the metadata of the failure node as the master node into the memory of the surviving node, so that the operation of the metadata can be performed.
S14: and controlling the write cache module to open a disk brushing switch, and controlling the transaction module to perform rollback redo on the unfinished transaction.
After the metadata root zone is recovered, the surviving node controls the write cache module to open a disk refreshing switch and controls the transaction module to perform rollback redo on the unfinished transaction.
It should be noted that, at this time, before the transaction module starts to work, the write cache module is required to allow the flushing to be performed in the single-control write-through mode, and the transaction module is required to perform redo according to the new write mode (i.e., the single-control log mode).
S15: and issuing the I/O request in the linked list to be processed.
After the transaction module performs rollback redo on an incomplete transaction, the surviving node enters a running state, at this time, the surviving node completes taking over a fault node task, the surviving node can resend the I/O request in the to-be-processed linked list, and the surviving node allows to receive a new I/O request, so that the full flash storage system can keep normal work, high availability of the system is achieved, and reliability of the system is improved.
In addition, the surviving node can send a notification of taking over the completion of the task of the failed node to the upper-layer service while issuing the I/O request in the chain table to be processed or after issuing the I/O request in the chain table to be processed, so that the upper-layer service can timely know the related information of the successful taking over of the failed node service. Taking an example that the cluster includes a node a and a node B, where the node B is a failed node and the node a is a surviving node, refer to fig. 2 specifically, which shows a flowchart for taking over a service of the failed node provided in the embodiment of the present application.
According to the technical scheme disclosed by the application, when the fault node exists in the cluster, all services of the fault node are managed through the live node, the transaction module and the write cache module in the metadata of the live node, so that the high availability of the full flash storage system is realized, and the reliability of the full flash storage system is improved.
Before reading a root area of metadata of a node with a failure as a master node into a memory of a surviving node, the cluster management method provided by the embodiment of the present application may further include:
and dividing a preset area from the logical address of the disk to be used as a root area of the metadata, and storing a root node of the metadata in the root area.
The general principle for metadata is to ensure that its data structure is ready for read and write operations. Considering that the most important of the metadata of the tree structure is the root node, the tree structure can be operated again only if the root node is available, so that in order to facilitate the repair of the metadata, a preset region can be divided from a logical address of a disk as a root zone of the metadata (for example, a part of the region is divided from zero as the root zone), and the root node of the metadata is stored in the root zone, so that the root node can be directly read into a memory by using the logical address to be recovered when an exception occurs in the system.
In addition, the root zone may be initialized once when the log volume is created, that is, the initialized root node may be used to perform a disk writing operation.
The cluster management method provided in this embodiment of the present application may store a root node of metadata in a root zone, and may include:
and storing the root node in the root zone in a double-copy mode.
In order to increase the reliability of the root zone and the root node, the root node may be stored in the root zone in a form of double copies, so as to implement a redundancy design through the form of double copies, thereby increasing the reliability of the root zone and the root node.
In the cluster management method provided in the embodiment of the present application, the root node may include a LunID, a CRC check value, and a MagicNumber.
The metadata root node includes, but is not limited to, a LunID, a Cyclic Redundancy Check (CRC) Check value, and a MagicNumber corresponding to the current tree.
And calculating a CRC check value and updating the CRC check value into a data structure of the root node every time the rootAddress is modified.
The CRC check value and the MagicNumber are set in the root node, and the two data can be utilized to realize multiple tests on the root node so as to improve the reliability of obtaining the root node.
The cluster management method provided in the embodiment of the present application may further include:
when the fault node is recovered to be normal, adding a newly received I/O request into the linked list to be processed;
waiting for a write cache module in the metadata of the surviving nodes to finish the task under brushing, and removing read cache data of which a main node is a fault node in a metadata read cache module;
switching the writing mode of the metadata into a mirror mode, and switching the main node mode into a mode containing a fault node after the normal recovery;
and synchronizing the writing mode and the main node mode to the fault node after the normal recovery, and recovering the main node as the root zone of the metadata of the fault node from the disk to the memory of the main node by the fault node after the normal recovery.
When a failed node is recovered to be normal, in order to improve the reliability and performance of the full flash memory system, the failed node which is recovered to be normal needs to take over tasks which originally belong to the failed node, and for the full flash memory system comprising two nodes, a dual-control mode needs to be formed again, at this time, a service switching-back operation is involved, and in the process, service I/O is not interrupted.
When the fault node is on-line again, all modules of the fault node carry out initialization operation, and all the modules carry out recovery operation of the configuration information. Considering that all configuration information is reserved in the surviving nodes, the nodes which come on line again only need to guarantee the synchronization time.
After the failed node recovers to normal, the surviving node performs a silent state in order to safely modify the configuration information, and at this time, the surviving node adds the newly received I/O request to the to-be-processed linked list and redos the I/O request in the running stage.
After the surviving node adds the newly received I/O request into the linked list to be processed, the surviving node waits for the write cache module in the metadata to complete the task of being refreshed, and can remove the read cache data of which the main node is the fault node in the metadata read cache module. The method includes that a surviving node takes over the service of a fault node before the fault node is recovered to be normal, therefore, a read cache originally belonging to the fault node is cached in the surviving node, and after the fault node is recovered to be normal, in order to reduce the pressure of the surviving node and reduce the resource occupation of the surviving node, the surviving node needs to eliminate read cache data originally belonging to the fault node.
Then, the surviving node enters a silent state, at this time, the surviving node can switch the write mode to the mirror mode, and switch the master node mode to a mode including the failed node after recovering to normal (i.e., switch the master node mode to the normal mode). For the description of the mirroring mode, reference may be made to the detailed description of the corresponding part in the service of taking over the failed node by the surviving node, which is not described herein again.
After the surviving node switches the write mode and the master node mode, the failed node after recovering to normal also enters a silent state, and at this time, the surviving node can synchronize the switched write mode and the master node mode to the failed node after recovering to normal.
After the failed node after recovering from the normal state is synchronized with the surviving node, the failed node after recovering from the normal state can recover the primary node as the root zone of the metadata of the failed node from the disk to the memory of the failed node, so as to complete the back-cut of the service.
In addition, after the failed node after recovering normal completes the back-switch of the service, the surviving node and the failed node after recovering normal can inform the upper layer service that the back-switch has been completed, so that the full flash memory system can conveniently work according to a normal mode, and the reliability and the performance of the full flash memory system are improved.
Taking an example that the cluster includes a node a and a node B, where the node B is a failed node and the node a is a surviving node, refer to fig. 3, which shows a flowchart of service switching back of the failed node provided in this embodiment of the present application.
Referring to fig. 4, a schematic diagram of a failed node performing metadata root zone recovery after restoring to normal according to an embodiment of the present application is shown. In the cluster management method provided in the embodiment of the present application, after the recovery of the normal fault node, the recovery of the root area of the metadata, where the master node is the fault node, from the disk to the memory of the fault node may include:
s41: and the fault node traverses the logic address of the disk, and reads the root area of the metadata of which the main node is the fault node into the memory of the fault node.
Wherein, the root zone can comprise a root node in the form of a double copy.
In combination with the above process, since the preset area is divided from the logical address of the disk as the root area of the metadata, and the root node is stored in the root area in a form of double copies, and the root node includes the CRC check value and the MagicNumber, when the metadata root area is restored, the logical address of the disk can be traversed by the failed node, and the root area with the master node as the failed node is read into the memory of the failed node.
Then, the failed node selects the root node in the form of the double copy read into the memory according to the timestamp, the CRC check value, and the MagicNumber (see specifically step S42 to step S47), so as to ensure that the root area and the root node of the recovered metadata are correct.
S42: and the fault node performs CRC check value check and MagicNumber check on the root node in the form of double copies.
S43: and the fault node judges whether the root nodes in the double-copy mode pass the verification, if so, the step S44 is executed, and if not, the step S45 is executed.
S44: and the fault node stores the root node with the later time stamp in the memory of the fault node.
S45: and the fault node judges whether one root node in the double-copy type root nodes passes the test, if so, the step S46 is executed, and if not, the step S47 is executed.
S46: and the fault node stores the root node passing the check in the memory of the fault node.
S47: the failed node marks the root zone as corrupted.
An embodiment of the present application further provides a cluster management device, see fig. 5, which shows a schematic structural diagram of the cluster management device provided in the embodiment of the present application, and the cluster management device is applied to a full flash storage system based on a cluster, and may include:
a first adding module 51, configured to add, when a failed node exists in the cluster, an I/O request newly received by a surviving node to the to-be-processed linked list;
a first control module 52, configured to control the transaction module and the write cache module in the metadata of the surviving node to stop waiting for receiving the message sent by the failed node, switch the write mode of the metadata to the log mode, and switch the master node mode to a mode that does not include the failed node;
the reading module 53 is configured to read a root area of metadata of a node with a failure as a main node into a memory of a surviving node;
the second control module 54 is configured to control the write cache module to open a disk-flushing switch, and control the transaction module to perform rollback redo on an incomplete transaction;
and the issuing module 55 is configured to issue the I/O request in the to-be-processed linked list.
The cluster management device provided in this embodiment may further include:
and the dividing and storing module is used for dividing a preset area from the logic address of the disk as a root area of the metadata and storing the root node of the metadata in the root area before reading the root area of the metadata of which the main node is the fault node into the memory of the survival node.
In the cluster management apparatus provided in the embodiment of the present application, the dividing and storing module may include a storing unit, configured to store the root node in the root area in a form of a double copy.
In the cluster management device provided in the embodiment of the present application, the root node may include a LunID, a CRC check value, and a MagicNumber.
The cluster management device provided in this embodiment may further include:
the second adding module is used for adding the newly received I/O request into the linked list to be processed after the fault node is recovered to be normal;
the waiting module is used for waiting for the write cache module in the metadata of the surviving node to finish the task of brushing and removing the read cache data of which the main node is a fault node in the metadata read cache module;
the switching module is used for switching the writing mode of the metadata into a mirror mode and switching the main node mode into a mode containing a fault node after the normal recovery;
and the synchronization module is used for synchronizing the write mode and the main node mode to the fault node after the normal recovery, and recovering the main node as the root zone of the metadata of the fault node from the disk to the memory of the fault node by the fault node after the normal recovery.
An embodiment of the present application further provides a cluster management device, see fig. 6, which shows a schematic structural diagram of a cluster management device provided in the embodiment of the present application, and the schematic structural diagram may include:
a memory 61 for storing a computer program;
the processor 62 is configured to execute the computer program stored in the memory 62 to implement the following steps:
when a fault node exists in the cluster, adding an I/O request newly received by a surviving node into a to-be-processed linked list; controlling a transaction module and a write cache module in the metadata of the survival node to stop waiting for receiving a message sent by a fault node, switching a write mode of the metadata into a log mode, and switching a main node mode into a mode without the fault node; reading a root area of metadata of a fault node as a main node into a memory of a survival node; controlling the write cache module to open a disk-brushing switch and controlling the transaction module to perform rollback redoing on the unfinished transaction; and issuing the I/O request in the linked list to be processed.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the following steps:
when a fault node exists in the cluster, adding an I/O request newly received by a surviving node into a to-be-processed linked list; controlling a transaction module and a write cache module in the metadata of the survival node to stop waiting for receiving a message sent by a fault node, switching a write mode of the metadata into a log mode, and switching a main node mode into a mode without the fault node; reading a root area of metadata of a fault node as a main node into a memory of a survival node; controlling the write cache module to open a disk brushing switch and controlling the transaction module to perform rollback redo on an unfinished transaction; and issuing the I/O request in the linked list to be processed.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For a description of a relevant part in a cluster management apparatus, a device, and a computer-readable storage medium provided in the embodiments of the present application, please refer to a detailed description of a corresponding part in a cluster management method provided in the embodiments of the present application, which is not described herein again.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include elements inherent in the list. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A cluster management method is applied to a full flash storage system based on a cluster, and comprises the following steps:
when a fault node exists in the cluster, adding an I/O request newly received by a surviving node into a to-be-processed linked list;
controlling a transaction module and a write cache module in the live node metadata to stop waiting for receiving the message sent by the fault node, switching the write mode of the metadata into a log mode, and switching the main node mode into a mode without the fault node;
reading a root area of metadata of the fault node, which is a main node, into a memory of the survival node;
controlling the write cache module to open a disk-brushing switch and controlling the transaction module to perform rollback redo on an unfinished transaction;
issuing the I/O request in the linked list to be processed;
before reading the root area of the metadata of the fault node, which is the master node, into the memory of the surviving node, the method further includes:
and dividing a preset area from the logical address of the disk to be used as a root area of the metadata, and storing a root node of the metadata in the root area.
2. The cluster management method of claim 1, wherein storing a root node of the metadata in the root zone comprises:
and storing the root node in the root zone in a double-copy mode.
3. The cluster management method according to claim 2, wherein the root node comprises LunID, CRC check value, MagicNumber.
4. The cluster management method of claim 3, further comprising:
when the fault node is recovered to be normal, adding a newly received I/O request into a to-be-processed linked list;
waiting for the write cache module in the metadata of the surviving node to complete the task of being refreshed, and removing the read cache data of the failure node which is a main node in the metadata read cache module;
switching the writing mode of the metadata to a mirror mode, and switching the main node mode to a mode containing the fault node after the normal recovery;
and synchronizing the writing mode and the main node mode to the fault node after the normal recovery, and recovering the main node as a root area of the metadata of the fault node from the disk to a memory of the main node by the fault node after the normal recovery.
5. The cluster management method according to claim 4, wherein the recovering of the failed node from the failed node to recover the primary node as the root zone of the metadata of the failed node from the disk into its own memory includes:
the fault node traverses the logic address of the disk, and reads a main node as a root area of metadata of the fault node into a memory of the fault node; the root zone comprises the root node in a double-copy form;
the fault node carries out CRC check value check and MagicNumber check on the root node in a double-copy mode;
if the root nodes in the double copy forms pass the verification, the fault node stores the root node with the later timestamp in the memory of the fault node; if only one root node in the root nodes in the double copy mode passes the verification, the fault node stores the root node passing the verification in the memory of the fault node.
6. The cluster management device is applied to a full flash storage system based on a cluster, and comprises the following components:
the first adding module is used for adding an I/O request newly received by a surviving node into a to-be-processed linked list when a fault node exists in the cluster;
the first control module is used for controlling the transaction module and the write cache module in the live node metadata to stop waiting for receiving the message sent by the fault node, switching the write mode of the metadata into a log mode and switching the main node mode into a mode without the fault node;
the reading module is used for reading the main node as the root area of the metadata of the fault node into the memory of the survival node;
the second control module is used for controlling the write cache module to open a disk refreshing switch and controlling the transaction module to perform rollback redo on an unfinished transaction;
the issuing module is used for issuing the I/O request in the linked list to be processed;
further comprising:
and the dividing and storing module is used for dividing a preset area from a logic address of a disk as a root area of the metadata before reading the root area of the metadata of which the main node is the fault node into the memory of the survival node, and storing the root node of the metadata in the root area.
7. A cluster management device, comprising:
a memory for storing a computer program;
processor for implementing the steps of the cluster management method according to any of claims 1 to 4 when executing said computer program.
8. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the cluster management method according to any of the claims 1 to 5.
CN201910785358.2A 2019-08-23 2019-08-23 Cluster management method, device and equipment and readable storage medium Active CN110515557B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910785358.2A CN110515557B (en) 2019-08-23 2019-08-23 Cluster management method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910785358.2A CN110515557B (en) 2019-08-23 2019-08-23 Cluster management method, device and equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN110515557A CN110515557A (en) 2019-11-29
CN110515557B true CN110515557B (en) 2022-06-17

Family

ID=68626592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910785358.2A Active CN110515557B (en) 2019-08-23 2019-08-23 Cluster management method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN110515557B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109358812A (en) * 2018-10-09 2019-02-19 郑州云海信息技术有限公司 Processing method, device and the relevant device of I/O Request in a kind of group system
CN111124307B (en) * 2019-12-20 2022-06-07 北京浪潮数据技术有限公司 Data downloading and brushing method, device, equipment and readable storage medium
CN113448513B (en) * 2021-05-28 2022-08-09 山东英信计算机技术有限公司 Data reading and writing method and device of redundant storage system
CN113342512B (en) * 2021-08-09 2021-11-19 苏州浪潮智能科技有限公司 IO task silencing and driving method and device and related equipment
CN115905114B (en) * 2023-03-09 2023-05-30 浪潮电子信息产业股份有限公司 Batch updating method and system of metadata, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7805632B1 (en) * 2007-09-24 2010-09-28 Net App, Inc. Storage system and method for rapidly recovering from a system failure
CN104933132A (en) * 2015-06-12 2015-09-23 广州巨杉软件开发有限公司 Distributed database weighted voting method based on operating sequence number
CN105159818A (en) * 2015-08-28 2015-12-16 东北大学 Log recovery method in memory data management and log recovery simulation system in memory data management
CN109582502A (en) * 2018-12-03 2019-04-05 郑州云海信息技术有限公司 Storage system fault handling method, device, equipment and readable storage medium storing program for executing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20060117505A (en) * 2005-05-11 2006-11-17 인하대학교 산학협력단 A recovery method using extendible hashing based cluster log in a shared-nothing spatial database cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7805632B1 (en) * 2007-09-24 2010-09-28 Net App, Inc. Storage system and method for rapidly recovering from a system failure
CN104933132A (en) * 2015-06-12 2015-09-23 广州巨杉软件开发有限公司 Distributed database weighted voting method based on operating sequence number
CN105159818A (en) * 2015-08-28 2015-12-16 东北大学 Log recovery method in memory data management and log recovery simulation system in memory data management
CN109582502A (en) * 2018-12-03 2019-04-05 郑州云海信息技术有限公司 Storage system fault handling method, device, equipment and readable storage medium storing program for executing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
B Y Kushal ; M Chitra."Cluster based routing protocol to prolong network lifetime through mobile sink in WSN".《2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT)》.2017, *
王嘉豪 ; 蔡鹏 ; 钱卫宁 ; 周傲英."集群数据库系统的日志复制和故障恢复".《软件学报》.2016, *

Also Published As

Publication number Publication date
CN110515557A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN110515557B (en) Cluster management method, device and equipment and readable storage medium
US20230117542A1 (en) Remote Data Replication Method and System
US10860547B2 (en) Data mobility, accessibility, and consistency in a data storage system
CN103077222B (en) Cluster file system distributed meta data consistance ensuring method and system
US7779295B1 (en) Method and apparatus for creating and using persistent images of distributed shared memory segments and in-memory checkpoints
CN102891849B (en) Service data synchronization method, data recovery method, data recovery device and network device
JP5559821B2 (en) Method for storing data, method for mirroring data, machine-readable medium carrying an instruction sequence, and program for causing a computer to execute the method
US7836162B2 (en) Transaction processing system and transaction processing method
JP2006023889A (en) Remote copy system and storage system
CN110673978B (en) Data recovery method and related device after power failure of double-control cluster
JP5201133B2 (en) Redundant system, system control method and system control program
CN113220729A (en) Data storage method and device, electronic equipment and computer readable storage medium
WO2018076633A1 (en) Remote data replication method, storage device and storage system
CN113326006A (en) Distributed block storage system based on erasure codes
US10983709B2 (en) Methods for improving journal performance in storage networks and devices thereof
US10235256B2 (en) Systems and methods for highly-available file storage with fast online recovery
CN104991739A (en) Method and system for refining primary execution semantics during metadata server failure substitution
CN115955488B (en) Distributed storage copy cross-machine room placement method and device based on copy redundancy
JP2009265973A (en) Data synchronization system, failure recovery method, and program
US10846012B2 (en) Storage system for minimizing required storage capacity during remote volume replication pair duplication
US10656867B2 (en) Computer system, data management method, and data management program
JP5488681B2 (en) Redundant system, control method and control program
Al Hubail Data replication and fault tolerance in AsterixDB
CN114860773A (en) Method and device for solving double-write consistency of cache database
CN113835930A (en) Cache service recovery method, system and device based on cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant