CN113726553A - Node fault recovery method and device, electronic equipment and readable storage medium - Google Patents

Node fault recovery method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN113726553A
CN113726553A CN202110864473.6A CN202110864473A CN113726553A CN 113726553 A CN113726553 A CN 113726553A CN 202110864473 A CN202110864473 A CN 202110864473A CN 113726553 A CN113726553 A CN 113726553A
Authority
CN
China
Prior art keywords
target node
node
fault
service
failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110864473.6A
Other languages
Chinese (zh)
Inventor
孙俊逸
颜秉珩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202110864473.6A priority Critical patent/CN113726553A/en
Publication of CN113726553A publication Critical patent/CN113726553A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a node fault recovery method, a node fault recovery device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: monitoring the process state of the containerized service process in each service node; if the process state is detected to be a failed target node, locking the target node; identifying the fault type of the target node, and performing fault recovery on the target node according to the fault type; carrying out data synchronization on the target node by using the distributed message queue, and re-uploading the target node after the data synchronization is finished; the process state is monitored, and node locking, fault type identification and fault recovery are carried out when a target node is detected. After the fault is recovered, the target node is online again after data synchronization, the automatic fault repairing process is completed, operation and maintenance personnel do not need to manually repair the fault, the fault repairing efficiency is improved, and the operation and maintenance workload is reduced.

Description

Node fault recovery method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of distributed cluster technologies, and in particular, to a node failure recovery method, a node failure recovery apparatus, an electronic device, and a computer-readable storage medium.
Background
As an important component in big data ecology, a distributed message queue system (e.g., a Kafka service cluster) constructed based on components such as a Kafka generally undertakes a large amount of data flow tasks, and its stability is crucial to stable operation of services. Although the cluster has a certain high availability due to its own distributed architecture, the failure of a service node in the cluster still affects the actual service. Because the cluster does not have a correction mechanism, when a node fails, the fault must be repaired by manual intervention, which causes large operation and maintenance workload and increased operation and maintenance cost.
Therefore, how to solve the problem of large operation and maintenance workload in the related art is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, an object of the present application is to provide a node failure recovery method, a node failure recovery apparatus, an electronic device, and a computer-readable storage medium, which improve failure recovery efficiency and reduce operation and maintenance workload.
In order to solve the above technical problem, the present application provides a node failure recovery method, including:
monitoring the process state of the containerized service process in each service node;
if the process state is detected to be a failed target node, locking the target node;
identifying the fault type of the target node, and performing fault recovery on the target node according to the fault type;
and carrying out data synchronization on the target node by using the distributed message queue, and re-uploading the target node after the data synchronization is finished.
Optionally, the identifying the fault type of the target node includes:
performing network test on the target node to obtain a first test result;
if the first test result shows that the network fails, determining that the failure type is the network failure;
and if the first test result shows that the network is normal, determining the fault type as a storage fault.
Optionally, if the fault type is a network fault, the performing fault recovery on the target node according to the fault type includes:
restarting the network card service of the target node, and performing network test to obtain a second test result;
and if the second test result shows that the network is normal, restarting the containerized service process and determining that the fault recovery is completed.
Optionally, if the failure type is a storage failure, performing failure recovery on the target node according to the failure type includes:
determining a storage fault reason of the target node;
if the storage failure reason is that the storage space is insufficient, expanding the storage space of the target node by using a first target disk, and restarting the containerization service process;
and if the storage failure reason is that the data disk is missing, replacing the missing data disk by using a second target disk, opening the cluster internal communication port of the target node, synchronizing the missing data with other service nodes by using the cluster internal communication port, and restarting the containerized service process.
Optionally, the performing data synchronization on the target node by using a distributed message queue includes:
opening a cluster internal communication port of the target node;
acquiring target data in the distributed message queue from other service nodes by using the cluster internal communication port to complete data synchronization; and the target data is newly added data in the fault time period of the target node.
Optionally, the locking the target node includes:
closing the cluster internal communication port and the external communication port of the target node;
broadcasting an offline notification for stopping data interaction with the target node to other service nodes in the cluster;
correspondingly, the re-uploading the target node after the data synchronization is completed includes:
and opening the external communication port and allowing the other service nodes to perform data interaction with the target node.
Optionally, the method further comprises:
and reporting the fault information and/or the fault recovery information.
The present application further provides a node failure recovery apparatus, including:
the monitoring module is used for monitoring the process state of the containerized service process in each service node;
the locking module is used for locking the target node if the process state is detected to be the failed target node;
the recovery module is used for identifying the fault type of the target node and performing fault recovery on the target node according to the fault type;
and the unlocking module is used for carrying out data synchronization on the target node by utilizing the distributed message queue and re-uploading the target node after the data synchronization is finished.
The present application further provides an electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the above node failure recovery method.
The present application also provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the above-described node failure recovery method.
The node fault recovery method provided by the application monitors the process state of the containerized service process in each service node; if the process state is detected to be a failed target node, locking the target node; identifying the fault type of the target node, and performing fault recovery on the target node according to the fault type; and carrying out data synchronization on the target node by using the distributed message queue, and re-uploading the target node after the data synchronization is finished.
Therefore, in the method, the containerization service process for providing the distributed message queue service is deployed on each service node, and the containerization processing is performed on the process, so that the service process can be isolated from the operating system on the service node, and the service node can be continuously controlled to perform fault recovery after the service process fails. By monitoring the process state, when the process state of a certain service node is a fault, the process state is determined as a target node and the target node is locked, so that the target node is taken off the cluster, and the problems of data inconsistency and the like caused by continuous work of the target node are avoided. Because different faults need to be repaired in different modes, the fault type of the target node is firstly identified, and the fault repair is carried out on the target node in a corresponding mode according to the fault type. During the fault repairing period, the whole cluster still provides service outwards, in order to ensure data consistency, the distributed message queue is needed to carry out data synchronization on the target node, and the target node is put on line again after the data synchronization is completed, so that the target node can continue to provide service outwards. The process state is monitored, and node locking, fault type identification and fault recovery are carried out when a target node is detected. After the fault is recovered, the target node is online again after data synchronization, the automatic fault repairing process is completed, manual repairing of operation and maintenance personnel is not needed, the fault repairing efficiency is improved, the operation and maintenance workload is reduced, and the problem of large operation and maintenance workload in the related technology is solved.
In addition, the application also provides a node fault recovery device, electronic equipment and a computer readable storage medium, and the same has the beneficial effects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a node failure recovery method according to an embodiment of the present application;
fig. 2 is a flowchart of a specific node failure recovery method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a node failure recovery apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart illustrating a node failure recovery method according to an embodiment of the present disclosure.
The method comprises the following steps:
s101: and monitoring the process state of the containerized service process in each service node.
The service node refers to a node in the cluster, which can provide distributed message queue service to the outside. It should be noted that, some or all of the steps in this application may be executed by a master node in a distributed cluster, and the master node may be a non-service node that performs failure recovery on the cluster; or, a certain designated service node can be used for monitoring and fault recovery of itself and other service nodes in the cluster while providing services to the outside.
Service processes, specifically distributed message queue service processes, are deployed in the service nodes in the cluster. With the service process, the service node can provide distributed message queue service to the outside. In this embodiment, the specific type of the service node is not limited, and the types of the service node and the cluster are different when the cluster is constructed by using different components. For example, the Kafka component (i.e., the kaffa component) is an open source component that is used to construct a high-throughput distributed publish-subscribe messaging system. The Kafka component is an important component of data transmission in big data ecology, and a distributed publish-subscribe message system formed by using the Kafka component is collectively called a Kafka service cluster, and the cluster generally has a plurality of service nodes Broker, which is called Kafka Broker.
Kafka broker is specifically used to store Kafka temporary data and also to correlate Topic (i.e. subjects) in the cluster. In the cluster formed by using the Kafka components, the Topic serves as an object for data subscription consumption.
The containerization service process is a result obtained by processing the service process by using a containerization technology. It should be noted that, the service node does not directly deploy the service process, but deploys the containerized service process obtained by containerizing the service process, thereby completing the deployment of the service process. Containerization techniques allow developers to package applications for deployment into devices as individual containers. By deploying containerized service processes, service processes for providing services can be isolated at the software level from operating system processes for maintaining the basic functioning of service nodes. This makes the service node fail to provide service to outside even if the containerized service process is in error, but it can still maintain basic operation, for example, it can respond to control instruction and provide the basis for automatic recovery of node failure.
The process state refers to a state for indicating the running condition of the containerized service process. By monitoring the process state, the running condition of the containerized service process on each service node can be determined. Since the service node needs to provide a service to the outside by using the containerized service process, it can be understood that the process state directly indicates whether the service node has the capability of providing the service to the outside, and the service node which cannot provide the service to the outside is the failed service node.
The embodiment does not limit the specific monitoring manner of the process state, and in an implementation manner, the process state may be a certain state parameter in the service node. In this case, the process of monitoring the process state is to read the current state parameters in the service node, and compare the current state parameters with a plurality of normal state parameters under normal operation of the process. If the comparison is consistent, the process state is determined to be normal, otherwise, the process state is determined to be a fault. In another embodiment, the service node may automatically close a process that cannot normally run, so that when the service process runs abnormally, the service node automatically closes the service process. In this case, the process of monitoring the process status is to detect whether there is a containerized service process in the service node. Specifically, the process list of the service node may be matched by using unique identity information (e.g., a process number) indicating the containerized service process, and if the matching is successful, it is determined that the containerized service process is operating normally, so that it may be determined that the process state is normal, otherwise, it is a failure.
It can be understood that monitoring each service node may be performed according to a preset period, for example, monitoring every hour; in another embodiment, the monitoring may be performed in real time, i.e., immediately after the last monitoring is completed, the next monitoring is performed.
S102: and if the process state is detected to be the failed target node, locking the target node.
And if the process state of a certain service node is a fault, determining the service node as a target node. And the target node refers to the service node with the fault. In order to ensure the data consistency of the cluster and avoid the target node from influencing the work of other normal service nodes, the target node needs to be locked and is taken off the cluster in a locking mode so as to carry out fault repair.
For the specific mode of locking the target node, the locking mode may be different according to the specific type of the service node. In one embodiment, the service node is kafka broker, and the process of locking the target node may include:
step 11: and closing the cluster internal communication port and the external communication port of the target node.
Step 12: and broadcasting offline notification for stopping data interaction with the target node to other service nodes in the cluster.
The cluster internal communication port is a port used for data interaction with other service nodes, and data synchronization between the service nodes can be performed by using the port. The external communication port is a port for providing a service to the outside. Since the target node has failed, it is unable to provide service normally and the data on it becomes untrusted. In this case, to avoid affecting the operation of other service nodes in the cluster, the internal communication port and the external communication port of the cluster need to be closed. Meanwhile, an offline notification for stopping data interaction with the target node needs to be broadcast in the cluster, so that other service nodes know that the target node fails and do not perform data synchronization with the target node.
S103: and identifying the fault type of the target node, and performing fault recovery on the target node according to the fault type.
It will be appreciated that there may be a variety of reasons for a service node failure, and that different types of failures may need to be recovered in different ways. Therefore, before the target node is failed to recover, the failure type of the target node needs to be identified first. For the specific identification mode of the fault type, in one implementation, the working log of the target node may be analyzed to obtain the fault type; in another embodiment, various detections may be performed on the target node, and the factors that the target node affects the containerized service process are identified by the detection.
In particular, for Kafka clusters, it is common for a network failure or storage failure to cause a service node failure. Accordingly, the process of identifying the type of failure of the target node may include the steps of:
step 21: and carrying out network test on the target node to obtain a first test result.
Step 22: and if the first test result shows that the network fails, determining that the failure type is the network failure.
Step 23: and if the first test result shows that the network is normal, determining the fault type as a storage fault.
Since the failure types of the target node are generally the two types, when determining the failure types, it may be detected whether the network is normal or whether the data storage is normal. In this embodiment, a network test may be performed on the target node to determine whether the network connection of the target node is normal, and a specific manner of the network test is not limited. After the network test, a first test result reflecting the network state of the target node can be obtained. If the first test result shows that the network fails, the failure type is the network failure, and the network failure causes the failure of the target node; if the first test result shows that the network is normal, the fault type is a storage fault, and a target node is in fault due to the fact that data storage is abnormal.
Further, after determining the fault type, if the fault type is a network fault, the process of performing fault recovery on the target node according to the fault type may include the following steps:
step 31: restarting the network card service of the target node, and performing network test to obtain a second test result.
Step 32: and if the second test result shows that the network is normal, restarting the containerized service process and determining that the fault recovery is completed.
Because the network problem is usually caused by the network card, after the fault type is determined to be the network fault, the network state of the target node can be updated in a mode of restarting the network card service. Specifically, the restarting process of the network card service may be an independent software restarting; or, the network card hardware is further restarted in the software restarting process, that is, the whole network card service restarting process includes software restarting and hardware restarting. And after the network card service is restarted, carrying out network test again, judging whether the network connection of the target node after the restart is normal or not, wherein the result obtained after the network test is a second test result. If the second test result indicates that the network is normal, the containerized service process can be restarted to recover to the state before the fault occurs, and the fault recovery is determined to be completed. The embodiment does not limit the operation executed when the second test result indicates a network fault, for example, the fault may be reported, manual intervention may be requested, or the network card service may be restarted, or the network test may be performed again after waiting for a preset time.
Further, if the failure type is a storage failure, the process of performing failure recovery on the target node according to the failure type may include the following steps:
step 41: and determining the storage fault reason of the target node.
Step 42: and if the storage failure reason is that the storage space is insufficient, expanding the storage space of the target node by using the first target disk, and restarting the containerization service process.
Step 43: and if the storage failure is caused by the missing of the data disk, replacing the missing data disk by using the second target disk, opening the cluster internal communication port of the target node, synchronizing the missing data with other service nodes by using the cluster internal communication port, and restarting the containerization service process.
The storage failure refers to a failure caused by an abnormal data storage of a target node, and the specific reason is that new data cannot be stored or stored data is lost, and failure recovery modes corresponding to different failure reasons are different. Specifically, after it is determined that the target node has a storage failure, a storage failure cause is further determined, where the storage failure cause includes insufficient storage space or data disk missing.
As to a specific determination method of the storage failure reason, in a feasible implementation manner, the current remaining space may be obtained, and whether the current remaining space is in a preset interval is determined, where the preset interval is an interval used to indicate that the storage space is insufficient. And if the data disc is in the preset interval, determining that the storage failure reason is insufficient storage space, otherwise, determining that the data disc is missing. In another possible implementation manner, current data disk information and historical data disk information may be obtained, where the current data disk information is information corresponding to a data disk that can be currently read and written, and the historical data disk information is information corresponding to a data disk that can be read and written by a target node after a last data disk change occurs. If the current data disk information is matched with the historical data disk information, the storage failure reason is that the storage space is insufficient, otherwise, the data disk is missing.
And if the storage space is insufficient, expanding the storage space of the target node by using the first target disk, and restarting the containerization service process so that the target node can store new data. The first target disk may be an unused disk of another service node, or may be a spare disk dedicated to performing storage space expansion, and the specific number, model, and the like of the spare disk are not limited. If the data disk is missing, some data in the target node is lost, in this case, a new second target disk is first used to replace the missing disk, i.e. the missing data disk, and the intra-cluster communication port of the target node is opened so as to use it to perform missing data synchronization with other service nodes. And after the missing data is synchronized, restarting the containerization service process, and restoring the target node to the state before the fault occurs.
S104: and carrying out data synchronization on the target node by using the distributed message queue, and re-uploading the target node after the data synchronization is finished.
After the failure is recovered, the target node can be recovered to the state before the failure occurs. However, during the failure recovery process, the cluster still provides service, so even if the target node recovers to the state before the failure occurs, the target node still has data out of synchronization with other service nodes. In order to enable the target node to provide services to the outside, data synchronization is required to be performed on the target node by using a distributed message queue, and data newly added in the queue in the failure period of the target node is synchronized to the target node. Specifically, when the cluster is a kafka cluster, the process of performing data synchronization on the target node by using the distributed message queue may include:
step 51: and opening the cluster internal communication port of the target node.
Step 52: and acquiring target data in the distributed message queue from other service nodes by using the cluster internal communication port so as to complete data synchronization.
And the target data is newly added data in the fault time period of the target node.
It should be noted that, when the failure of the target node is a storage failure caused by a data disk missing, data synchronization is required in the failure recovery process, and an internal communication port of the cluster needs to be opened. Therefore, when the data synchronization is carried out, the internal communication port of the cluster does not need to be opened again.
And after the data synchronization is finished, the target node is on-line again, and the mode of the target node on-line corresponds to the mode of locking the target node. Specifically, when the service node is kafka broker, the target node is re-on-line after the data synchronization is completed, including:
step 61: and opening an external communication port and allowing other service nodes to perform data interaction with the target node.
In addition, in the process of online target nodes, or after the target nodes are online again, fault information and/or fault recovery information can be reported, so that a user can know the cluster operation condition conveniently. The failure information refers to information for recording the current failure condition, and the failure recovery information refers to information for recording the current failure recovery condition, and the specific form, content, and the like of the information are not limited.
By applying the node fault recovery method provided by the embodiment of the application, the containerization service process for providing the distributed message queue service is deployed on each service node, and the containerization processing is performed on the process, so that the service process can be isolated from the operating system on the service node, and the service node can be continuously controlled to perform fault recovery after the service process fails. By monitoring the process state, when the process state of a certain service node is a fault, the process state is determined as a target node and the target node is locked, so that the target node is taken off the cluster, and the problems of data inconsistency and the like caused by continuous work of the target node are avoided. Because different faults need to be repaired in different modes, the fault type of the target node is firstly identified, and the fault repair is carried out on the target node in a corresponding mode according to the fault type. During the fault repairing period, the whole cluster still provides service outwards, in order to ensure data consistency, the distributed message queue is needed to carry out data synchronization on the target node, and the target node is put on line again after the data synchronization is completed, so that the target node can continue to provide service outwards. The process state is monitored, and node locking, fault type identification and fault recovery are carried out when a target node is detected. After the fault is recovered, the target node is online again after data synchronization, the automatic fault repairing process is completed, manual repairing of operation and maintenance personnel is not needed, the fault repairing efficiency is improved, the operation and maintenance workload is reduced, and the problem of large operation and maintenance workload in the related technology is solved.
Based on the foregoing embodiments, please refer to fig. 2, and fig. 2 is a flowchart of a specific node failure recovery method according to an embodiment of the present application. Specifically, after the detection is started, the service state is queried, and it is determined whether there is a service (i.e., containerized service process) failure on a certain service node, and if it is detected that a certain service node has a failure, the service node is locked, determined as a target node, and a recovery mechanism is triggered. Specifically, whether the storage space of the target node reaches the upper limit is judged, if yes, the currently stored data is migrated to the first target disk, the first target disk is used for expanding the storage space, and fault recovery is completed. And if the storage does not reach the upper limit, judging whether the network is limited. And if the network is limited, restarting the network card service and waiting for the network to recover. And if the network is not limited, restarting the containerization service process and judging whether the service is recovered.
Specifically, for kafka clusters, a Broker node network failure will cause service communication blocking, data synchronization and data distribution blocking. After a recovery mechanism is triggered, network problem diagnosis is carried out, if the network problem is the network problem, network card service restarting operation is tried, network connectivity test is carried out after restarting, and next processing is carried out after the network service is normal. If the Broker storage has a problem, according to the problem type processing, the Broker failure caused by insufficient disk space tries to move the stored data to other disks, and if the disks cannot be used due to other reasons, the new disks are tried to be used as data disks of the Kafka service, and synchronization requests of missing data are initiated to other Brokers, so that the data are synchronized from other Broker nodes, but data operation is not provided for other Brokers. An attempt is made to initiate the failed node Broker process after the above-described process is completed.
If the processed Broker node still cannot be started normally due to disk, program and network problems, collecting error information such as Broker starting error information, disk conditions, network conditions and the like and informing an operation and maintenance administrator to try manual intervention. If the node is started normally, firstly opening an internal communication port, opening internal data processing, waiting for the synchronization completion of the Topic related data during the offline period, unlocking the data service of the Broker node to the outside after the synchronization is completed, and getting the node on-line in the Kafka service.
In the following, the node failure recovery apparatus provided in the embodiment of the present application is introduced, and the node failure recovery apparatus described below and the node failure recovery method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a node failure recovery apparatus according to an embodiment of the present application, including:
a monitoring module 110, configured to monitor a process state of a containerized service process in each service node;
a locking module 120, configured to lock a target node if the process state is detected as a failed target node;
the recovery module 130 is configured to identify a fault type of the target node, and perform fault recovery on the target node according to the fault type;
and the unlocking module 140 is configured to perform data synchronization on the target node by using the distributed message queue, and re-bring the target node on line after the data synchronization is completed.
Optionally, the recovery module 130 includes:
the network testing unit is used for carrying out network testing on the target node to obtain a first testing result;
the network fault determining unit is used for determining the fault type as the network fault if the first test result shows the network fault;
and the storage fault determining unit is used for determining the fault type as a storage fault if the first test result shows that the network is normal.
Optionally, the recovery module 130 includes:
the network card service restarting unit is used for restarting the network card service of the target node and carrying out network test to obtain a second test result;
and the service process restarting unit is used for restarting the containerized service process and determining that the fault recovery is finished if the second test result indicates that the network is normal.
Optionally, the recovery module 130 includes:
the failure cause determining unit is used for determining the storage failure cause of the target node;
the storage space expansion unit is used for expanding the storage space of the target node by using the first target disk and restarting the containerization service process if the storage failure reason is that the storage space is insufficient;
and the data disk replacing unit is used for replacing the missing data disk by using the second target disk if the storage failure reason is that the data disk is missing, opening the cluster internal communication port of the target node, synchronizing the missing data with other service nodes by using the cluster internal communication port, and restarting the containerization service process.
Optionally, the unlocking module 140 comprises:
an internal port opening unit, configured to open a cluster internal communication port of a target node;
the data synchronization unit is used for acquiring target data in the distributed message queue from other service nodes by utilizing the cluster internal communication port so as to complete data synchronization; and the target data is newly added data in the fault time period of the target node.
Optionally, the locking module 120 comprises:
the port closing unit is used for closing the cluster internal communication port and the external communication port of the target node;
the broadcasting unit is used for broadcasting offline notification for stopping data interaction with the target node to other service nodes in the cluster;
accordingly, the unlocking module 140 includes:
and the opening and broadcasting unit is used for opening the external communication port and allowing other service nodes to perform data interaction with the target node.
Optionally, the method further comprises:
and the reporting module is used for reporting the fault information and/or the fault recovery information.
In the following, the electronic device provided by the embodiment of the present application is introduced, and the electronic device described below and the node failure recovery method described above may be referred to correspondingly.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
The processor 101 is configured to control the overall operation of the electronic device 100, so as to complete all or part of the steps in the above-mentioned node failure recovery method; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.
The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.
The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components, and is configured to perform the node failure recovery method according to the above embodiments.
In the following, a computer-readable storage medium provided by the embodiments of the present application is introduced, and the computer-readable storage medium described below and the node failure recovery method described above may be referred to correspondingly.
The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned node failure recovery method.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A method for recovering node failure, comprising:
monitoring the process state of the containerized service process in each service node;
if the process state is detected to be a failed target node, locking the target node;
identifying the fault type of the target node, and performing fault recovery on the target node according to the fault type;
and carrying out data synchronization on the target node by using the distributed message queue, and re-uploading the target node after the data synchronization is finished.
2. The node failure recovery method of claim 1, wherein the identifying the type of failure of the target node comprises:
performing network test on the target node to obtain a first test result;
if the first test result shows that the network fails, determining that the failure type is the network failure;
and if the first test result shows that the network is normal, determining the fault type as a storage fault.
3. The node failure recovery method according to claim 2, wherein if the failure type is a network failure, the performing failure recovery on the target node according to the failure type includes:
restarting the network card service of the target node, and performing network test to obtain a second test result;
and if the second test result shows that the network is normal, restarting the containerized service process and determining that the fault recovery is completed.
4. The node failure recovery method according to claim 2, wherein if the failure type is a storage failure, the performing failure recovery on the target node according to the failure type includes:
determining a storage fault reason of the target node;
if the storage failure reason is that the storage space is insufficient, expanding the storage space of the target node by using a first target disk, and restarting the containerization service process;
and if the storage failure reason is that the data disk is missing, replacing the missing data disk by using a second target disk, opening the cluster internal communication port of the target node, synchronizing the missing data with other service nodes by using the cluster internal communication port, and restarting the containerized service process.
5. The node failure recovery method of claim 1, wherein the synchronizing the data of the target node using the distributed message queue comprises:
opening a cluster internal communication port of the target node;
acquiring target data in the distributed message queue from other service nodes by using the cluster internal communication port to complete data synchronization; and the target data is newly added data in the fault time period of the target node.
6. The node failure recovery method of claim 1, wherein the locking the target node comprises:
closing the cluster internal communication port and the external communication port of the target node;
broadcasting an offline notification for stopping data interaction with the target node to other service nodes in the cluster;
correspondingly, the re-uploading the target node after the data synchronization is completed includes:
and opening the external communication port and allowing the other service nodes to perform data interaction with the target node.
7. The node failure recovery method of claim 1, further comprising:
and reporting the fault information and/or the fault recovery information.
8. A node failure recovery apparatus, comprising:
the monitoring module is used for monitoring the process state of the containerized service process in each service node;
the locking module is used for locking the target node if the process state is detected to be the failed target node;
the recovery module is used for identifying the fault type of the target node and performing fault recovery on the target node according to the fault type;
and the unlocking module is used for carrying out data synchronization on the target node by utilizing the distributed message queue and re-uploading the target node after the data synchronization is finished.
9. An electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor configured to execute the computer program to implement the node failure recovery method according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the node failure recovery method of any of claims 1 to 7.
CN202110864473.6A 2021-07-29 2021-07-29 Node fault recovery method and device, electronic equipment and readable storage medium Pending CN113726553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110864473.6A CN113726553A (en) 2021-07-29 2021-07-29 Node fault recovery method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110864473.6A CN113726553A (en) 2021-07-29 2021-07-29 Node fault recovery method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113726553A true CN113726553A (en) 2021-11-30

Family

ID=78674283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110864473.6A Pending CN113726553A (en) 2021-07-29 2021-07-29 Node fault recovery method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113726553A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242812A (en) * 2022-07-25 2022-10-25 济南浪潮数据技术有限公司 Node data synchronization method and device and computer readable storage medium
CN115277379A (en) * 2022-07-08 2022-11-01 北京城市网邻信息技术有限公司 Distributed lock disaster tolerance processing method and device, electronic equipment and storage medium
CN116208705A (en) * 2023-04-24 2023-06-02 荣耀终端有限公司 Equipment abnormality recovery method and electronic equipment

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080189573A1 (en) * 2007-02-02 2008-08-07 Darrington David L Fault recovery on a massively parallel computer system to handle node failures without ending an executing job
US20150149814A1 (en) * 2013-11-27 2015-05-28 Futurewei Technologies, Inc. Failure recovery resolution in transplanting high performance data intensive algorithms from cluster to cloud
CN104915263A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Process fault processing method and device based on container technology
CN104915285A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Container process monitoring method, device and system
CN106330576A (en) * 2016-11-18 2017-01-11 北京红马传媒文化发展有限公司 Automatic scaling and migration scheduling method, system and device for containerization micro-service
CN108173911A (en) * 2017-12-18 2018-06-15 中国科学院声学研究所 A kind of micro services fault detect processing method and processing device
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
CN109062655A (en) * 2018-06-05 2018-12-21 腾讯科技(深圳)有限公司 A kind of containerization cloud platform and server
CN109167690A (en) * 2018-09-25 2019-01-08 郑州云海信息技术有限公司 A kind of restoration methods, device and the relevant device of the service of distributed system interior joint
CN110119325A (en) * 2019-05-10 2019-08-13 深圳前海微众银行股份有限公司 Server failure processing method, device, equipment and computer readable storage medium
CN110300026A (en) * 2019-06-28 2019-10-01 北京金山云网络技术有限公司 A kind of network connectivity fai_lure processing method and processing device
CN110688202A (en) * 2019-10-09 2020-01-14 腾讯科技(深圳)有限公司 Service process scheduling method, device, equipment and storage medium
US20200026625A1 (en) * 2018-07-20 2020-01-23 Nutanix, Inc. Two node clusters recovery on a failure
US20200051419A1 (en) * 2017-10-11 2020-02-13 Analog Devices Global Unlimited Company Cloud-based machine health monitoring
CN111124755A (en) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 Cluster node fault recovery method and device, electronic equipment and storage medium
CN111782432A (en) * 2020-06-29 2020-10-16 中国工商银行股份有限公司 Method and device for acquiring data for container abnormity analysis
CN112269694A (en) * 2020-10-23 2021-01-26 北京浪潮数据技术有限公司 Management node determination method and device, electronic equipment and readable storage medium

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080189573A1 (en) * 2007-02-02 2008-08-07 Darrington David L Fault recovery on a massively parallel computer system to handle node failures without ending an executing job
US20150149814A1 (en) * 2013-11-27 2015-05-28 Futurewei Technologies, Inc. Failure recovery resolution in transplanting high performance data intensive algorithms from cluster to cloud
CN104915263A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Process fault processing method and device based on container technology
CN104915285A (en) * 2015-06-30 2015-09-16 北京奇虎科技有限公司 Container process monitoring method, device and system
CN106330576A (en) * 2016-11-18 2017-01-11 北京红马传媒文化发展有限公司 Automatic scaling and migration scheduling method, system and device for containerization micro-service
US20200051419A1 (en) * 2017-10-11 2020-02-13 Analog Devices Global Unlimited Company Cloud-based machine health monitoring
CN108173911A (en) * 2017-12-18 2018-06-15 中国科学院声学研究所 A kind of micro services fault detect processing method and processing device
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
CN109062655A (en) * 2018-06-05 2018-12-21 腾讯科技(深圳)有限公司 A kind of containerization cloud platform and server
US20200026625A1 (en) * 2018-07-20 2020-01-23 Nutanix, Inc. Two node clusters recovery on a failure
CN109167690A (en) * 2018-09-25 2019-01-08 郑州云海信息技术有限公司 A kind of restoration methods, device and the relevant device of the service of distributed system interior joint
CN110119325A (en) * 2019-05-10 2019-08-13 深圳前海微众银行股份有限公司 Server failure processing method, device, equipment and computer readable storage medium
CN110300026A (en) * 2019-06-28 2019-10-01 北京金山云网络技术有限公司 A kind of network connectivity fai_lure processing method and processing device
CN110688202A (en) * 2019-10-09 2020-01-14 腾讯科技(深圳)有限公司 Service process scheduling method, device, equipment and storage medium
CN111124755A (en) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 Cluster node fault recovery method and device, electronic equipment and storage medium
CN111782432A (en) * 2020-06-29 2020-10-16 中国工商银行股份有限公司 Method and device for acquiring data for container abnormity analysis
CN112269694A (en) * 2020-10-23 2021-01-26 北京浪潮数据技术有限公司 Management node determination method and device, electronic equipment and readable storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277379A (en) * 2022-07-08 2022-11-01 北京城市网邻信息技术有限公司 Distributed lock disaster tolerance processing method and device, electronic equipment and storage medium
CN115277379B (en) * 2022-07-08 2023-08-01 北京城市网邻信息技术有限公司 Distributed lock disaster recovery processing method and device, electronic equipment and storage medium
CN115242812A (en) * 2022-07-25 2022-10-25 济南浪潮数据技术有限公司 Node data synchronization method and device and computer readable storage medium
CN116208705A (en) * 2023-04-24 2023-06-02 荣耀终端有限公司 Equipment abnormality recovery method and electronic equipment
CN116208705B (en) * 2023-04-24 2023-09-05 荣耀终端有限公司 Equipment abnormality recovery method and electronic equipment

Similar Documents

Publication Publication Date Title
CN110798375B (en) Monitoring method, system and terminal equipment for enhancing high availability of container cluster
CN113726553A (en) Node fault recovery method and device, electronic equipment and readable storage medium
US10592330B2 (en) Systems and methods for automatic replacement and repair of communications network devices
CN107660289B (en) Automatic network control
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN108710673B (en) Method, system, computer device and storage medium for realizing high availability of database
CN109144789B (en) Method, device and system for restarting OSD
US9665452B2 (en) Systems and methods for smart diagnoses and triage of failures with identity continuity
CN109726046B (en) Machine room switching method and device
Panda et al. {IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services
US9798606B2 (en) Systems and methods for smart diagnosis using hosted resources with intelligent altering of boot order
Pourmajidi et al. On challenges of cloud monitoring
US11347601B1 (en) Managing data center failure events
CN110659159A (en) Service process operation monitoring method, device, equipment and storage medium
CN113312153B (en) Cluster deployment method and device, electronic equipment and storage medium
US9798625B2 (en) Agentless and/or pre-boot support, and field replaceable unit (FRU) isolation
WO2019061364A1 (en) Failure analyzing method and related device
US7373542B2 (en) Automatic startup of a cluster system after occurrence of a recoverable error
CN110119325A (en) Server failure processing method, device, equipment and computer readable storage medium
Tola et al. On the resilience of the NFV-MANO: An availability model of a cloud-native architecture
CN115964142A (en) Application service management method, device and storage medium
CN113300913B (en) Equipment testing method and device, testing equipment and storage medium
CN111737130B (en) Public cloud multi-tenant authentication service testing method, device, equipment and storage medium
CN112269693A (en) Node self-coordination method, device and computer readable storage medium
JP2008181432A (en) Health check device, health check method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination