CN118093250A

CN118093250A - Fault processing method and device, electronic equipment and storage medium

Info

Publication number: CN118093250A
Application number: CN202410504909.4A
Authority: CN
Inventors: 赵鹏
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-04-25
Filing date: 2024-04-25
Publication date: 2024-05-28

Abstract

The application discloses a fault processing method, a device, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: when the execution target event is wrong, terminating the execution of the target event, and modifying the cluster state into a recovery state; when a first cluster recovery event submitted by a master node is received, initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the node, if the verification is successful, submitting content is a second cluster recovery event of which the recovery of the node is successful, and if the verification is failed, submitting content is a second cluster recovery event of which the recovery of the node is failed; when a recovery completion event submitted by the master node is received, modifying the cluster state from a recovery state to a normal state; and the main node counts the proportion of the number of successfully recovered nodes to the number of the received second cluster recovery events, and submits a recovery completion event if the proportion is larger than a preset value. The application improves the fault processing and cluster recovery efficiency.

Description

Fault processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a fault handling method, a fault handling device, an electronic device, and a storage medium.

Background

Distributed storage clusters typically rely on some distributed coherency protocol to build a coherency framework and rely on a consistent state within the cluster provided by the coherency framework to coordinate the behavior of nodes within the cluster to achieve high expansion and high availability of the cluster. The control state machine on each node carries out consistent reading and writing on the cluster state under the coordination of the consistency framework, and controls the application end on each node to make the same behavior under the same state, thereby completing the coordination action effect of each node in the cluster.

Under the normal running condition, the states of all nodes in the cluster are consistent and the behaviors are consistent, but if the cluster states are abnormally changed, the abnormal value can be read by state machines on all the nodes in the cluster, and due to the consistent behaviors of all the state machines, all the nodes in the cluster can terminate the service process after reading the same abnormal value, so that the service and the cluster are down. In the related art, after such problems occur, the manual recovery is highly dependent on implementation personnel, and the failure processing and cluster recovery efficiency are low.

Therefore, how to improve the failure handling and cluster recovery efficiency is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a fault processing method and device, electronic equipment and storage medium, and the fault processing and cluster recovery efficiency is improved.

To achieve the above object, the present application provides a fault handling method applied to a node in a distributed storage cluster, the method including:

When an error occurs in executing a target event, terminating the execution of the target event, and modifying the cluster state into a recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error;

When the first cluster recovery event submitted by the master node is received, initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the node, if the verification is successful, submitting content is a second cluster recovery event of which the recovery of the node is successful, and if the verification is failed, submitting content is a second cluster recovery event of which the recovery of the node is failed;

When a recovery completion event submitted by the master node is received, modifying the cluster state from a recovery state to a normal state; and if the proportion of the number of the nodes which are successfully recovered by the statistics of the master node to the number of the received second cluster recovery events is larger than a preset value, submitting the recovery completion event.

Wherein, when the execution target event is wrong, terminating the execution of the target event, comprising:

the execution of the target event is terminated when a target event occurrence code predicate is executed on the first clustered copy.

Before the cluster state is modified to the recovery state, the method further comprises:

and covering the first cluster copy by using the second cluster copy to obtain a new first cluster copy.

Wherein, the modifying the cluster state to the recovery state includes:

And modifying the cluster state of the new first cluster copy into a recovery state, switching to the second cluster copy, and modifying the cluster state of the second cluster copy into the recovery state.

and recording error information of the target event in a diagnosis log, and storing the error information into a local storage.

The error information comprises any one or a combination of any of a code segment memory address where the assertion is located, a level of the target event, and content of the target event.

After executing the error of the target event, the master node submits a first cluster recovery event with the content of the level of the target event;

correspondingly, before initiating the read-write verification to the cluster recovery reserved area of the user volume with the home node being the node, the method further comprises the following steps:

Judging whether the level contained in the first cluster recovery event is consistent with the level recorded by the node;

if the data is consistent, executing the step of initiating read-write verification to the cluster recovery reserved area of the user volume with the home node being the node;

If not, submitting the second cluster recovery event with the content of the node recovery failure.

Wherein, after modifying the cluster state to the recovery state, the method further comprises:

and determining a target node submitting the target event.

Wherein after determining to submit the target node of the target event, the method further comprises:

And generating alarm information based on the node information of the target node and the error information of the target event.

Wherein after generating the alarm information based on the node information of the target node and the error information of the target event, the method further comprises:

Calling a preset callback, and setting an error code of the preset callback as an event submission error; and the target node prohibits repeated submission of the target event after receiving the error code of the event submission error.

Wherein, still include:

When an event to be executed is received, determining a cluster state;

If the cluster state is the recovery state, judging whether the event to be executed is the first cluster recovery event or not;

if yes, executing the step of initiating read-write verification to the cluster recovery reserved area of the user volume with the home node being the node.

Wherein after determining the cluster state, the method further comprises:

And if the cluster state is the normal state, directly executing the event to be executed.

Wherein after the determining whether the event to be executed is the first cluster recovery event, the method further includes:

And if the event to be executed is not the first cluster recovery event, skipping the execution of the event to be executed.

The initiating the read-write verification to the cluster recovery reserved area of the user volume with the home node as the home node comprises the following steps:

And initiating read-write verification to the cluster recovery reserved area of the user volume of the home node serving as the home node through an interface of the host for issuing the read-write event.

The initiating the read-write verification to the cluster recovery reserved area of the user volume with the home node as the node through the interface of the host issuing the read-write event comprises the following steps:

sending a writing request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a reading and writing event so as to write target data into the cluster recovery reserved area;

Sending a read request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a read-write event so as to read data from the cluster recovery reserved area;

Judging whether the read data is consistent with the target data or not; if yes, the verification is successful.

After the interface for issuing the read-write event through the host sends the write request to the cluster recovery reserved area of the user volume with the home node as the home node, the method further comprises:

If the write waiting time exceeds the preset time, submitting the content as a second cluster recovery event of the node recovery failure.

The second cluster recovery event further includes an index of the node, and the master node controls the corresponding node to leave the distributed storage cluster according to the index included in the second cluster recovery event with failure recovery.

To achieve the above object, the present application provides a fault handling apparatus applied to a node in a distributed storage cluster, the apparatus comprising:

The termination module is used for terminating the execution of the target event when the execution of the target event is wrong, and modifying the cluster state into a recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error;

The verification module is used for initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the node when the first cluster recovery event submitted by the master node is received, if the verification is successful, the submitted content is a second cluster recovery event of which the recovery of the node is successful, and if the verification is failed, the submitted content is a second cluster recovery event of which the recovery of the node is failed;

The modification module is used for modifying the cluster state from the recovery state to the normal state when receiving a recovery completion event submitted by the master node; and if the proportion of the number of the nodes which are successfully recovered by the statistics of the master node to the number of the received second cluster recovery events is larger than a preset value, submitting the recovery completion event.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

and a processor for implementing the steps of the fault handling method as described above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the fault handling method as described above.

According to the scheme, the fault processing method provided by the application comprises the following steps: when an error occurs in executing a target event, terminating the execution of the target event, and modifying the cluster state into a recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error; when the first cluster recovery event submitted by the master node is received, initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the node, if the verification is successful, submitting content is a second cluster recovery event of which the recovery of the node is successful, and if the verification is failed, submitting content is a second cluster recovery event of which the recovery of the node is failed; when a recovery completion event submitted by the master node is received, modifying the cluster state from a recovery state to a normal state; and if the proportion of the number of the nodes which are successfully recovered by the statistics of the master node to the number of the received second cluster recovery events is larger than a preset value, submitting the recovery completion event.

In the application, a cluster recovery reserve area is set for a user volume, when an error occurs in an execution target event, the method is different from the method in the related art that a node exits from a cluster, the execution of the target event is replaced by new logic, namely, the execution of the target event is terminated, the cluster state is modified into a recovery state, each node initiates read-write verification to the cluster recovery reserve area of the user volume of which the home node is the node, and the cluster recovery is completed when the node proportion of successful statistical verification of the main node is larger than a preset value. Therefore, the fault processing method provided by the application realizes the automatic recovery of the clusters and improves the fault processing and cluster recovery efficiency. The application also discloses a fault processing device, an electronic device and a computer readable storage medium, and the technical effects can be realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

FIG. 1 is an architecture diagram of nodes in a distributed storage cluster, according to an example embodiment;

FIG. 2 is a flow chart illustrating a fault handling method according to an exemplary embodiment;

FIG. 3 is a flow chart illustrating another fault handling method according to an exemplary embodiment;

FIG. 4 is a block diagram of a fault handling apparatus according to an exemplary embodiment;

Fig. 5 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. In addition, in the embodiments of the present application, "first", "second", etc. are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence.

The application is applied to a distributed storage cluster, which comprises a plurality of servers interconnected through a network, wherein each storage server is provided with a rear-end disk or a rear-end disk cabinet. The disks may be shared within a storage cluster that may be multi-tiered and virtualized on the back-end disks, including storage pools, raid (Redundant Arrays of INDEPENDENT DISKS, redundant array of independent disks) groups, virtual disks, etc., to provide higher performance, throughput, availability, and other diversified data storage access services than stand-alone storage. The storage server cluster and the front-end hosts access the same front-end network through which the storage servers serve storage services (virtual disks) to the front-end hosts.

In a distributed storage cluster, critical state data needs to be shared within the cluster to avoid single point failures causing data to be inaccessible. Typical critical state data includes the mapping of virtual disks to physical disks, the mapping of hosts to virtual disks, whether a virtual disk is online, whether a host is online, and so forth. To achieve this goal, each node within the storage cluster runs the same cluster and application software. The architecture of each node is shown in fig. 1, and includes a consistency protocol layer, a service module control layer, and a service module application layer.

The consistency protocol layer is used for maintaining the existence of clusters, collecting and distributing cluster events and providing a cluster state space for the service module.

Specifically, when network links between all nodes in the cluster are normal, the coherence protocol layer maintains a cluster heartbeat to confirm that all nodes in the cluster are active. When the heartbeat of the node is lost or a link fault occurs, the consistency protocol layer on each node calculates whether the number of the nodes of a certain network partition exceeds half of the total number of the nodes of the last stable cluster, and the network partition takes over the cluster.

The event collection of the consistency protocol layer faces to each service module, each service module can send events to the consistency protocol layer, and the event distribution of the consistency protocol layer can ensure that each node receives the same event sequence.

The consistency protocol layer provides a cluster state space for the service module, and the service module can read and write the service module by using a fixed interface provided by the consistency protocol layer. The consistency protocol layer ensures that the initial cluster state of each node is the same, and ensures that the writing action of the service module on the cluster state is completely consistent, thereby ensuring that the states on all nodes are consistent. To ensure atomicity of a series of write operations triggered by a single event, the cluster state has two completely identical copies, namely a first cluster copy and a second cluster copy, and the business module needs to write the two copies serially to complete the final state modification. Because there are two copies, the node fails when writing to either copy, and the state of the other copy is intact, so the cluster state can be restored to a consistent state by rolling back or rolling forward from the other copy.

Events in the cluster have a level which increases monotonically from zero, the events are temporarily and permanently stored on nodes in the cluster in the distribution process, the effect of the events is completed after the execution of the events is completed, the level of the events is converted into the latest level of the cluster state, and therefore the storage space is not required to be occupied any more. The EVENT sequence number is infinitely growing, but there is not enough room to keep all EVENTs all the time, so there are only a limited number of EVENTs kept at each node, these buffered EVENTs are called RECENT EVENTs, the maximum number of which is noted RENM (RECENTEVENT NUM MAX). Recent events are persistently recorded and updated by scrolling, and the queue is a recent event circular queue. Both copies of the cluster state and recent events are persisted and not lost due to software and hardware failures or power loss.

Event commit success requires at least a majority of nodes within the cluster to return success. Events that commit successfully enter the end of the recent event loop queue waiting for execution. Events in the recent event circular queue may be dequeued for execution from the head of the queue. After one event execution is completed, the cluster state is changed, and then the next event dequeues. In normal operation, the execution sequence of events on all nodes of the cluster is consistent, but the time concurrency is not rigidly guaranteed.

The consistency protocol layer judges that the nodes belong to the same cluster and have two conditions, wherein one condition is that the nodes have common unique cluster identification, and the other condition is that the nodes in the cluster have network links which are mutually connected in pairs. Nodes meeting these conditions are automatically pulled into the cluster and if the two conditions are no longer met, the nodes in the cluster are kicked out of the cluster. When the nodes leave, the consistency protocol layer judges whether the number of the rest nodes in the cluster exceeds half of the number of the nodes of the last stably running cluster, and only when the number of the rest nodes exceeds half, the rest nodes can take over the cluster.

During cluster operation, if a node leaves briefly because of a network or software failure, the state is likely not up-to-date already when it reverts to the cluster, and thus needs to be synchronized from other nodes. Depending on the state of the node and the outdated degree of the recent event, there may be one of two recovery modes, if the level range of the recent event stored by the node when the node leaves is [ N, n+ RENM-1], and the level of the event stored by the node in the cluster when the node returns is n+ RENM, then it is only necessary to send the missing event from the node containing the event to the node behind the node. If the state level of all the cluster nodes in the cluster exceeds n+ RENM, that is, the n+ RENM events on all the nodes are validated and the cluster state is merged, there is no way to make the secondary node catch up by synchronizing the recent events, and at this time, the present node synchronizes a complete cluster state from other nodes in the cluster.

Each business module can add a sub-module to the business module control layer, the core of which is an event handling function. After receiving the event sent by the consistency protocol layer, the service module can call the corresponding logic to read and write the corresponding cluster state, and decides which specific actions of the service module application layer are called according to the states before and after the reading and writing.

Each service module can add a plurality of sub-modules to the service module application layer, and the sub-modules are controlled by the same control layer service sub-module. The service module application layer may send an event to the coherence protocol layer as needed, and the event may include a callback. The event is distributed to each node in the cluster by the consistency protocol layer, and the inside of the node is transferred to the corresponding service module control layer. After the control layer completes writing of the two copies of the cluster state, the action of the corresponding application sub-module is called, the action is returned to the consistency protocol layer after completion, and the consistency protocol layer finds the application layer sub-module initiating the event and calls the callback contained in the event.

The architecture can ensure that the cluster state can be copied to all nodes in the cluster uniformly. The modification of the cluster state is triggered by events, and the coherence protocol layer ensures that each node within the cluster performs these events in the same order. This framework enables difficult node task coordination of the storage cluster, but if there is a problem in the code of the business module control layer event handling that causes a node crash, typically as a valid value assertion in the code, the failure location is the business module control layer logic between steps 3-5 in fig. 1, then when this event is executed, all nodes will crash because the consistency protocol layer distributes this event to all nodes of the cluster. Worse, because these events are persisted, they are re-executed after the node reboots, ultimately resulting in all nodes within the cluster repeatedly crashing. Therefore, a software problem may cause the cluster to be down, the availability of the cluster is reduced to a single machine level, and the continuity of the service is seriously affected.

Therefore, in the application, a cluster recovery reserve area is set for the user volume, when the execution target event is wrong, the method is different from the method in the related art that the node exits the cluster, the method is replaced by new logic, namely, the execution of the target event is stopped, the cluster state is modified into a recovery state, each node initiates read-write verification to the cluster recovery reserve area of the user volume with the home node as the node, the main node counts the node proportion of successful verification, and the cluster recovery is completed when the node proportion is larger than a preset value. Therefore, the fault processing method provided by the application realizes the automatic recovery of the clusters and improves the fault processing and cluster recovery efficiency.

The embodiment of the application discloses a fault processing method, which improves the fault processing and cluster recovery efficiency.

Referring to fig. 2, a flowchart of a fault handling method is shown according to an exemplary embodiment, as shown in fig. 2, including:

S101: when an error occurs in executing a target event, terminating the execution of the target event, and modifying the cluster state into a recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error;

The execution main body of the embodiment is each node in the distributed storage cluster, the target event reaches the service module control layer, and when an error occurs in the process of modifying the first cluster copy, the service module terminates the execution of the target event, and modifies the cluster state from the normal state to the recovery state.

As a possible implementation manner, when the execution target event is wrong, modifying the cluster state into the recovery state includes: when the target event occurrence code assertion is performed on the first cluster replica, the cluster state is modified to a recovery state. In a specific implementation, since the service module control layer modifies the first cluster copy and the second cluster copy at two stages of event execution, respectively, and is completely identical, termination always occurs when the first cluster copy is accessed. A failed assertion is encountered while the target event is performed on the first cluster replica, at which point the cluster state is modified from a normal state to a recovery state.

As a possible implementation manner, before modifying the cluster state into the recovery state, the method further includes: and covering the first cluster copy by using the second cluster copy to obtain a new first cluster copy. In a specific implementation, all nodes in the cluster perform state rollback, that is, the second cluster copy is used to cover the first cluster copy.

As a possible implementation manner, the modifying the cluster state into the recovery state includes: and modifying the cluster state of the new first cluster copy into a recovery state, switching to the second cluster copy, and modifying the cluster state of the second cluster copy into the recovery state. In a specific implementation, the cluster state of the new first cluster copy is modified from the normal state to the recovery state, then the second cluster copy is switched to, and the cluster state of the second cluster copy is modified from the normal state to the recovery state.

As a preferred embodiment, before modifying the cluster state into the recovery state, the method further includes: and recording error information of the target event in a diagnosis log, and storing the error information into a local storage. In implementations, the node records the error information of the target EVENT into the diagnostic log and persists the local store, and the error information may include the code segment memory ADDRESS (event_failed_address) where the assertion is located, the LEVEL of the target EVENT (event_failed_level), the content of the target EVENT (EVENT-FAILED), and so on.

And the master node submits a first cluster recovery event with the content of the level of the target event after executing the error of the target event.

As a preferred embodiment, after modifying the cluster state to the recovery state, the method further includes: and determining a target node submitting the target event. In a specific implementation, the consistency protocol layer is positioned to a service module application layer for sending the target event and a node where the service module application layer is positioned, and records the information to a diagnosis log and persistent storage.

Further, after determining to submit the target node of the target event, the method further includes: and generating alarm information based on the node information of the target node and the error information of the target event. In a specific implementation, the coherence protocol layer instructs the traffic module application layer to report alarms to the user.

As a preferred embodiment, after generating the alarm information based on the node information of the target node and the error information of the target event, the method further includes: calling a preset callback, and setting an error code of the preset callback as an event submission error; and the target node prohibits repeated submission of the target event after receiving the error code of the event submission error. In a specific implementation, the coherence protocol layer invokes a preset callback, sets the error code of the callback as an EVENT commit error (event_failed_do_not_ RESEND), and the service module that receives the EVENT commit error cannot resend the EVENT, so as NOT to repeatedly trigger the same error.

S102: when the first cluster recovery event submitted by the master node is received, initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the node, if the verification is successful, submitting content is a second cluster recovery event of which the recovery of the node is successful, and if the verification is failed, submitting content is a second cluster recovery event of which the recovery of the node is failed;

In this step, after the first cluster recovery event submitted by the master node is successfully submitted, the execution stage is entered, and each node initiates a read-write check to the cluster recovery reserved area of the user volume of which the home node is the node. And submitting a second cluster recovery event after the verification is successful, wherein the second cluster recovery event comprises a flag of successful recovery of the node and an index of the node. And submitting a second cluster recovery event when the verification fails, wherein the second cluster recovery event comprises a flag of the recovery failure of the node and an index of the node.

As a possible implementation manner, the initiating the read-write verification to the cluster recovery reserved area of the user volume of the home node for the home node includes: and initiating read-write verification to the cluster recovery reserved area of the user volume of the home node serving as the home node through an interface of the host for issuing the read-write event. In a specific implementation, the service module control layer instructs the service module application layer of the present node to initiate a host-like IO (Input/Output) to perform read-write verification for the cluster recovery reserved area of all user volumes of the present node, the IO and the host call the same entry issuing function, and the read-write process is completed asynchronously.

As a possible implementation manner, the initiating, by the interface for issuing a read-write event by the host, a read-write check to the cluster recovery reserved area of the user volume of the home node, where the cluster recovery reserved area is the home node includes: sending a writing request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a reading and writing event so as to write target data into the cluster recovery reserved area; sending a read request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a read-write event so as to read data from the cluster recovery reserved area; judging whether the read data is consistent with the target data or not; if yes, the verification is successful. In specific implementation, writing target data into the cluster recovery reserved area, then reading the data from the cluster recovery reserved area, judging whether the read data is consistent with the written target data, and if so, checking successfully, which means that the node service is not affected.

As a possible implementation manner, after the sending, by the interface that issues the read-write event through the host, the write request to the cluster recovery reserved area of the user volume that is the home node, the method further includes: if the write waiting time exceeds the preset time, submitting the content as a second cluster recovery event of the node recovery failure. In a specific implementation, if the write waiting time is too long, the recovery is considered unsuccessful, and the service module application layer submits a second cluster recovery event, which includes a flag of the recovery failure of the node and an index of the node.

As a preferred embodiment, before initiating the read-write verification to the cluster recovery reserved area of the user volume of the home node, the method further includes: judging whether the level contained in the first cluster recovery event is consistent with the level recorded by the node; if the data is consistent, executing the step of initiating read-write verification to the cluster recovery reserved area of the user volume with the home node being the node; if not, submitting the second cluster recovery event with the content of the node recovery failure. In a specific implementation, before executing a first cluster recovery event submitted by a master node, a service module control layer compares whether the levels of error events recorded by a node in the first cluster recovery event and the node are consistent, if so, the service module control layer instructs a service module application layer of the node to perform read-write verification, and if not, the service module application layer submits a second cluster recovery event, wherein the second cluster recovery event comprises a flag of failure recovery of the node and an index of the node.

S103: when a recovery completion event submitted by the master node is received, modifying the cluster state from a recovery state to a normal state; and if the proportion of the number of the nodes which are successfully recovered by the statistics of the master node to the number of the received second cluster recovery events is larger than a preset value, submitting the recovery completion event.

In a specific implementation, the second cluster recovery events sent by all the nodes are converged to the cluster master node, and the master node initiates the submission. The submitted event is submitted to the service module to control the layer to execute, which maintains the event count of the second cluster recovery event, the bit vector of the node which is successful in recovery, and the bit vector of the node which is failed in recovery, all of which are set to zero when the first cluster recovery event is received, and are increased or set when the second cluster recovery event is received. If the total number of N nodes in the cluster is assumed, N second cluster recovery events are executed, the event count of the second cluster recovery events is N, and if the set number of successful recovery is greater than or equal to N/2, the successful recovery is judged. And if the recovery is successful, the service module application layer of the master node sends a recovery completion event. When each node receives the recovery completion event, the cluster state is set to be in a normal state according to the general flow of event processing.

As a possible implementation manner, the master node controls the corresponding node to leave the distributed storage cluster according to the index included in the second cluster recovery event of the recovery failure. In a specific implementation, the master node schedules the node that will fail to recover to leave the cluster later to trigger the switching of the home node of its home volume, and the cluster recovery ends.

In the embodiment of the application, a cluster recovery reserved area is set for a user volume, when an error occurs in an execution target event, the method is different from the method in the related art that a node exits from a cluster, the method is replaced by new logic, namely, the execution of the target event is stopped, the cluster state is modified into a recovery state, each node initiates read-write verification to the cluster recovery reserved area of the user volume of which the home node is the node, and the cluster recovery is completed when the node proportion of successful statistical verification of a main node is larger than a preset value. Therefore, the fault processing method provided by the embodiment of the application realizes the automatic recovery of the cluster and improves the fault processing and cluster recovery efficiency.

The embodiment of the application discloses a fault processing method, and compared with the previous embodiment, the technical scheme of the embodiment is further described and optimized. Specific:

Referring to fig. 3, a flowchart of another fault handling method is shown according to an exemplary embodiment, as shown in fig. 3, including:

S201: when an error occurs in executing a target event, terminating the execution of the target event, recording error information of the target event in a diagnosis log, and storing the error information into a local storage; the error information comprises any one or a combination of any of a code segment memory address where the assertion is located, the level of the target event and the content of the target event;

S202: covering the first cluster copy by using a second cluster copy to obtain a new first cluster copy, modifying the cluster state of the new first cluster copy into a recovery state, switching to the second cluster copy, and modifying the cluster state of the second cluster copy into the recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error;

S203: when an event to be executed is received, determining a cluster state;

in this embodiment, the cluster state needs to be checked before each event is executed, and when the cluster state is the recovery state, S204 is entered, and when the cluster state is the normal state, S209 is entered.

S204: if the cluster state is the recovery state, judging whether the event to be executed is the first cluster recovery event or not; if yes, go to S205; if not, entering S208;

S205: judging whether the level contained in the first cluster recovery event is consistent with the level recorded by the node; if so, entering S206; if not, proceeding to S207;

S206: initiating read-write verification to a cluster recovery reserved area of a user volume of which the home node is the home node, if the verification is successful, submitting a second cluster recovery event with the content that the home node is successful in recovery, and if the verification is failed, submitting the second cluster recovery event with the content that the home node is failed in recovery;

S207: submitting a second cluster recovery event with the content of failure recovery of the node;

S208: skipping execution of the event to be executed;

In a specific implementation, when the cluster state is the recovery state, it is determined whether the EVENT to be executed is the first cluster recovery EVENT, if not, the specific logic execution of the EVENT is SKIPPED to directly execute the callback, and the error code is set as event_skip_ RESEND. And skipping part of subsequent events of the problem event, and avoiding cluster state modification before the influence of the problem event is completely eliminated. The corresponding business module application layer can decide whether to resend the event according to the module requirement after receiving the callback.

S209: and if the cluster state is the normal state, directly executing the event to be executed.

The following describes a fault handling apparatus according to an embodiment of the present application, and a fault handling apparatus described below and a fault handling method described above may be referred to each other.

Referring to fig. 4, a structure diagram of a fault handling apparatus according to an exemplary embodiment is shown, as shown in fig. 4, including:

a termination module 100, configured to terminate execution of a target event when an error occurs in executing the target event, and modify a cluster state to a recovery state; the method comprises the steps that a master node submits a first cluster recovery event after executing the target event and generating an error;

the verification module 200 is configured to initiate a read-write verification to a cluster recovery reserved area of a user volume of which the home node is the home node when the first cluster recovery event submitted by the master node is received, if the verification is successful, the submitted content is a second cluster recovery event of which the home node is successful in recovery, and if the verification is failed, the submitted content is a second cluster recovery event of which the home node is failed in recovery;

The modifying module 300 is configured to modify, when a recovery completion event submitted by the master node is received, a cluster state from a recovery state to a normal state; and if the proportion of the number of the nodes which are successfully recovered by the statistics of the master node to the number of the received second cluster recovery events is larger than a preset value, submitting the recovery completion event.

In the embodiment of the application, a cluster recovery reserved area is set for a user volume, when an error occurs in an execution target event, the method is different from the method in the related art that a node exits from a cluster, the method is replaced by new logic, namely, the execution of the target event is stopped, the cluster state is modified into a recovery state, each node initiates read-write verification to the cluster recovery reserved area of the user volume of which the home node is the node, and the cluster recovery is completed when the node proportion of successful statistical verification of a main node is larger than a preset value. Therefore, the fault processing device provided by the embodiment of the application realizes the automatic recovery of the cluster and improves the fault processing and cluster recovery efficiency.

Based on the above embodiment, as a preferred implementation manner, the termination module 100 is specifically configured to: when the target event occurrence code assertion is performed on the first cluster replica, the cluster state is modified to a recovery state.

On the basis of the above embodiment, as a preferred implementation manner, the method further includes:

And the coverage module is used for covering the first cluster copy by adopting the second cluster copy to obtain a new first cluster copy.

Based on the above embodiment, as a preferred implementation manner, the termination module 100 is specifically configured to: and modifying the cluster state of the new first cluster copy into a recovery state, switching to the second cluster copy, and modifying the cluster state of the second cluster copy into the recovery state.

And the recording module is used for recording the error information of the target event in the diagnosis log and storing the error information into a local storage.

Based on the above embodiment, as a preferred implementation manner, the error information includes any one or a combination of any several of a code segment memory address where the assertion is located, a level of the target event, and content of the target event.

On the basis of the above embodiment, as a preferred implementation manner, after executing the target event and making an error, the master node submits a first cluster recovery event whose content is the level of the target event;

correspondingly, the method further comprises the steps of:

The first judging module is used for judging whether the level contained in the first cluster recovery event is consistent with the level recorded by the node; if the two types are consistent, starting the workflow of the verification module 200; if not, submitting the second cluster recovery event with the content of the node recovery failure.

And the first determining module is used for determining a target node submitting the target event.

and the alarm module is used for generating alarm information based on the node information of the target node and the error information of the target event.

the callback module is used for calling a preset callback and setting an error code of the preset callback as an event submission error; and the target node prohibits repeated submission of the target event after receiving the error code of the event submission error.

the second determining module is used for determining the cluster state when receiving the event to be executed; if the cluster state is the recovery state, starting the workflow of a second judging module; if the cluster state is the normal state, directly executing the event to be executed;

The second judging module is used for judging whether the event to be executed is the first cluster recovery event or not; if yes, starting the workflow of the verification module 200; if not, skipping the execution of the event to be executed.

On the basis of the above embodiment, as a preferred implementation manner, the verification module 200 is specifically configured to: and initiating read-write verification to the cluster recovery reserved area of the user volume of the home node serving as the home node through an interface of the host for issuing the read-write event.

On the basis of the above embodiment, as a preferred implementation manner, the verification module 200 is specifically configured to: sending a writing request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a reading and writing event so as to write target data into the cluster recovery reserved area; sending a read request to a cluster recovery reserved area of a user volume of a home node serving as the home node through an interface of a host issuing a read-write event so as to read data from the cluster recovery reserved area; judging whether the read data is consistent with the target data or not; if yes, the verification is successful.

On the basis of the above embodiment, as a preferred implementation manner, the verification module 200 is further configured to: if the write waiting time exceeds the preset time, submitting the content as a second cluster recovery event of the node recovery failure.

On the basis of the foregoing embodiment, as a preferred implementation manner, the second cluster recovery event further includes an index of the node, and the master node controls the corresponding node to leave the distributed storage cluster according to the index included in the second cluster recovery event that fails to recover.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Based on the hardware implementation of the program modules, and in order to implement the method according to the embodiment of the present application, the embodiment of the present application further provides an electronic device, and fig. 5 is a block diagram of an electronic device according to an exemplary embodiment, and as shown in fig. 5, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other devices and is used for executing the fault processing method provided by one or more technical schemes when running the computer program. And the computer program is stored on the memory 3.

Of course, in practice, the various components in the electronic device are coupled together by a bus system 4. It will be appreciated that the bus system 4 is used to enable connected communications between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. But for clarity of illustration the various buses are labeled as bus system 4 in fig. 5.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The non-volatile Memory may be, among other things, a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), Magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), and, Double data rate synchronous dynamic random access memory (DDRSDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), Direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2 or implemented by the processor 2. The processor 2 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 2 or by instructions in the form of software. The processor 2 described above may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 3 and the processor 2 reads the program in the memory 3 to perform the steps of the method described above in connection with its hardware.

The corresponding flow in each method of the embodiments of the present application is implemented when the processor 2 executes the program, and for brevity, will not be described in detail herein.

In an exemplary embodiment, the present application also provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program executable by the processor 2 for performing the steps of the method described above. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, CD-ROM, etc.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied essentially or in part in the form of a software product stored in a storage medium, including instructions for causing an electronic device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application.

Claims

1. A method of fault handling, for nodes in a distributed storage cluster, the method comprising:

2. The fault handling method of claim 1, wherein terminating execution of the target event when an error occurs in executing the target event comprises:

3. The method of claim 2, wherein before modifying the cluster state to the recovery state, further comprising:

4. A method of fault handling according to claim 3, wherein modifying the cluster state to a recovery state comprises:

5. The method of claim 2, wherein before modifying the cluster state to the recovery state, further comprising:

6. The fault handling method of claim 5, wherein the error information comprises any one or a combination of a code segment memory address where an assertion is located, a level of the target event, a content of the target event.

7. The method according to claim 5, wherein the master node submits a first cluster recovery event whose content is a level of the target event after performing the error of the target event;

8. The method of claim 1, wherein after modifying the cluster state to the recovery state, further comprising:

and determining a target node submitting the target event.

9. The method of claim 8, wherein after determining the target node that submitted the target event, further comprising:

10. The fault handling method of claim 9, wherein after generating the alarm information based on the node information of the target node and the error information of the target event, further comprising:

11. The fault handling method of claim 1, further comprising:

When an event to be executed is received, determining a cluster state;

12. The method of claim 11, further comprising, after determining the cluster state:

13. The method of claim 11, wherein the determining whether the event to be executed is the first cluster recovery event further comprises:

14. The method of claim 1, wherein the initiating a read-write check to the cluster recovery reserved area of the user volume of the home node for the home node comprises:

15. The fault handling method according to claim 1, wherein the initiating the read-write check to the cluster recovery reserved area of the user volume of the home node for the home node through the interface of the host issuing the read-write event includes:

16. The method for fault handling according to claim 15, wherein after the sending, by the interface for issuing the read-write event by the host, the write request to the cluster recovery reserved area of the user volume of the home node, the method further comprises:

17. The method of claim 1, wherein the second cluster recovery event further comprises an index of the node, and the master node controls the corresponding node to leave the distributed storage cluster according to the index included in the second cluster recovery event that fails to recover.

18. A failure handling apparatus for a node in a distributed storage cluster, the apparatus comprising:

19. An electronic device, comprising:

a memory for storing a computer program;

A processor for implementing the steps of the fault handling method as claimed in any one of claims 1 to 17 when said computer program is executed.

20. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the fault handling method according to any of claims 1 to 17.