CN117439867A

CN117439867A - Cluster upgrading fault processing method, device, equipment and storage medium

Info

Publication number: CN117439867A
Application number: CN202311566570.2A
Authority: CN
Inventors: 何倩
Original assignee: Zhongdian Cloud Computing Technology Co ltd
Current assignee: Zhongdian Cloud Computing Technology Co ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-01-23

Abstract

The invention discloses a cluster upgrade fault processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: judging whether a fault node occurs in the upgrading process of a storage cluster deployed by the open source platform Kubernetes; when determining that a fault node occurs, determining whether to replace the fault node with a new node according to role information of the fault node in a cluster; if yes, deleting the label marked on the fault node by using the open source platform, and marking the deleted label on the new node so as to complete service replacement processing of the fault node in the cluster upgrading process. According to the method and the device, the related service of the fault node is stably migrated to the new replacement node, so that the problem that the conventional cluster can only wait for fault repair to cause the increase of the upgrading time, meanwhile, the fault node is not required to be removed from the storage cluster in the upgrading process, and the cluster caused by data migration is prevented from being in an unstable upgrading state for a long time.

Description

Cluster upgrading fault processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of cluster upgrade technologies, and in particular, to a cluster upgrade fault processing method, apparatus, device, and storage medium.

Background

The open source platform Kubernetes is an open source container orchestration platform that can automatically deploy, extend, and manage containerized applications. When the cluster is in an upgrade state, part of the components are new versions, and part of the components are old versions. This is an unstable state for the cluster, which is the better the shorter the duration of the unstable state, i.e. the shorter the cluster upgrade time is required.

Traditional storage clusters are based on bare server deployment. After the node in the upgrading process fails, according to the role information of the failure point in the storage cluster, the upgrading process is continued, and if the failure node is not required to be replaced by a new node, the failure node can be deleted from the cluster; if a new node is needed to replace the failed node, no simple, smooth method is available to achieve this. When the latter situation is encountered, the upgrade process can be continued only after the fault repair is waited in general, that is, the upgrade process will greatly increase the upgrade time because the fault node blocks the upgrade process. In addition, both cases require the failed node to be deleted from the storage cluster, which can cause data migration and also can result in a significant increase in upgrade time.

In addition, when the cluster is not in an upgrade state, the nodes in the storage cluster are deleted, that is, the cluster is scaled, which is a normal requirement for the cluster. However, when the cluster is in an upgrade state, if the node fails, the capacity reduction is performed, which is equivalent to performing capacity reduction operation on the failed cluster, so that the operation of the cluster is more unstable.

Therefore, how to avoid the substantial increase of the upgrade time caused by the failure node in the cluster upgrade is a technical problem that needs to be solved at present.

Disclosure of Invention

The invention mainly aims to provide a cluster upgrading fault processing method, device, equipment and storage medium, which avoid the large increase of upgrading time caused by the fact that the traditional cluster can only wait for fault repair by realizing stable migration of related service of a fault node to a new replacement node, meanwhile, the fault node is not required to be removed from the storage cluster in an upgrading process, and the cluster is prevented from being in an unstable upgrading state for a long time caused by data migration.

In a first aspect, the present application provides a method for processing a cluster upgrade failure, where the method includes the steps of:

judging whether a fault node occurs in the upgrading process of a storage cluster deployed by the open source platform Kubernetes;

when determining that a fault node occurs, determining whether to replace the fault node with a new node according to role information of the fault node in a cluster;

if yes, deleting the label marked on the fault node by using the open source platform, and marking the deleted label on the new node so as to complete service replacement processing of the fault node in the cluster upgrading process.

With reference to the first aspect, as an optional implementation manner, when it is determined that a failed node does not need to be replaced, deleting the failed node from the open source platform cluster, and retaining the failed node in a storage cluster embedded in the open source platform cluster;

and after the upgrading of the open source platform cluster is finished, upgrading the storage cluster, and removing the fault node in the storage cluster after the upgrading of the storage cluster is finished.

With reference to the first aspect, as an optional implementation manner, during the storage cluster upgrade process, when a failed node is detected, the detected failed node is filtered.

With reference to the first aspect, as an optional implementation manner, the open source platform sequentially deletes the labels marked on the fault nodes, and sequentially marks new nodes with expanded capacity in the open source platform cluster with the labels deleted by the fault nodes after sequentially deleting the labels marked on the fault nodes;

when the open source platform detects the newly added label of the new node, the corresponding service is automatically pulled up on the new node so as to finish smooth migration of the service related to the fault node to the new node.

With reference to the first aspect, as an optional implementation manner, whether an upgrade fault occurs is determined according to upgrade time and error reporting information of a storage cluster deployed by the open source platform Kubernetes;

and when the upgrading fault is determined to occur, determining that a fault node exists in the storage cluster deployed by the open source platform.

With reference to the first aspect, as an optional implementation manner, when determining that there is a fault node, performing fault troubleshooting on all nodes based on a connection relationship between each node and a server or a fault signal, so as to determine the fault node, where the fault signal includes: motherboard fault signals, CPU fault signals, and network card fault signals.

With reference to the first aspect, as an optional implementation manner, it is determined whether the failed node is a master node in the cluster;

when the node is determined to be a master node, determining that the new node needs to replace the fault node;

when it is determined to be a non-master node, it is determined that the failed node need not be replaced with a new node.

In a second aspect, the present application provides a cluster upgrade fault handling apparatus, including:

the judging module is used for judging whether a fault node occurs in the upgrading process of the storage cluster deployed by the open source platform Kubernetes;

the determining module is used for determining whether to replace the fault node with a new node according to the role information of the fault node in the cluster when the fault node is determined to occur;

and the execution module is used for deleting the label marked on the fault node by using the open source platform if so, and marking the deleted label on the new node so as to complete the service replacement processing of the fault node in the cluster upgrading process.

With reference to the second aspect, as an alternative implementation manner,

in a third aspect, the present application further provides an electronic device, including: a processor; a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of the first aspects.

In a fourth aspect, the present application also provides a computer readable storage medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first aspects.

The application provides a cluster upgrade fault processing method, device, equipment and storage medium, wherein the method comprises the following steps: judging whether a fault node occurs in the upgrading process of a storage cluster deployed by the open source platform Kubernetes; when determining that a fault node occurs, determining whether to replace the fault node with a new node according to role information of the fault node in a cluster; if yes, deleting the label marked on the fault node by using the open source platform, and marking the deleted label on the new node so as to complete service replacement processing of the fault node in the cluster upgrading process. According to the method and the device, the related service of the fault node is stably migrated to the new replacement node, so that the problem that the conventional cluster can only wait for fault repair to cause the increase of the upgrading time, meanwhile, the fault node is not required to be removed from the storage cluster in the upgrading process, and the cluster caused by data migration is prevented from being in an unstable upgrading state for a long time.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart of a cluster upgrade fault handling method provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a cluster upgrade fault handling apparatus provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative failed node provided in an embodiment of the present application;

fig. 4 is a schematic diagram of an electronic device provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a computer readable program medium provided in an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

Embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a cluster upgrade fault processing method provided by the present invention, and as shown in fig. 1, the method includes the steps of:

and step S101, judging whether a fault node occurs in the upgrading process of the storage cluster deployed by the open source platform Kubernetes.

Specifically, judging whether an upgrade fault occurs according to the upgrade time and the error reporting information of a storage cluster deployed by the open source platform Kubernetes; and when the upgrading fault is determined to occur, determining that a fault node exists in the storage cluster deployed by the open source platform.

The method is convenient to understand and exemplify, when the upgrading progress is unchanged for a long time in the upgrading process, and the fault information of the upgrading process is combined, whether the upgrading is faulty or not is determined, if the upgrading is faulty, the faulty node exists in the upgrading process, and if the upgrading is faulty, the faulty node is judged by the communication relation between each node and the server, if the A node is found to be in a disconnection state with the server, the A node fault is determined, when a main board fault signal is detected, the faulty node exists in the upgrading process, the faulty node is detected, when a CPU fault signal is detected, the faulty node exists in the upgrading process, and when a network card fault signal is currently detected, the faulty node is determined. It should be noted that, the troubleshooting node may interact with the server through the node, for example, cannot ping through or cannot connect to the server.

And step S102, when determining that the fault node occurs, determining whether to replace the fault node with a new node according to the role information of the fault node in the cluster.

Specifically, whether the detected fault node is a master node or not is determined, and when the detected fault node is the master node, the new node is determined to be needed to replace the fault node; when it is determined to be a non-master node, it is determined that the failed node need not be replaced with a new node.

For ease of understanding and illustration, a storage cluster upgrade deployed based on an open source platform Kubernetes may be considered as an upgrade divided into two parts: kubernetes cluster and storage cluster upgrades. When the node fails in the upgrading process, according to role information of the failed node in the cluster, when the node is determined to be a master node, the new node is determined to be needed to replace the failed node; when it is determined to be a non-master node, it is determined that the failed node need not be replaced with a new node. It should be noted that the Kubernetes cluster may be considered as a foundation, the storage cluster is deployed on the basis of the Kubernetes cluster, that is, the storage cluster is a part of the Kubernetes cluster, the characteristics of the Kubernetes cluster are expandable, the storage cluster is embedded therein (the storage cluster is a component of the Kubernetes cluster), it is understood that embedding the storage cluster into the Kubernetes utilizes the characteristics of the Kubernetes (the characteristics refer to labeling a new node a with the same label as a failed node B, and B traffic automatically migrates to a).

And step S103, if yes, deleting the label marked on the fault node by using the open source platform, and marking the deleted label on the new node so as to complete service replacement processing of the fault node in the cluster upgrading process.

Specifically, when a fault node is found and is determined not to be a master node, the fault node is not required to be replaced by a new node, the fault node is firstly only deleted from the Kubernetes cluster, the fault node does not bear related services any more, and the upgrading process can be smoothly carried out; meanwhile, the fault node still remains in the storage cluster, so that data migration is not caused, and the upgrading time length is not increased. And deleting the fault node from the storage cluster after the upgrading is completed.

When the fault node is determined to be a main node, the fault node needs to be replaced by a new node, the new node is expanded by the Kubernetes cluster, the new expanded node is marked with a label of the same fault node, the service carried by the fault node can be automatically migrated to the new expanded node, then the fault node is deleted from the Kubernetes cluster, and meanwhile the fault node is still kept in the storage cluster. And finally deleting the fault node from the storage cluster after the upgrading is finished.

It should be noted that, when upgrading, both Kubernetes and storage clusters need to be upgraded, in general, kubernetes are upgraded first and then the storage clusters are upgraded, but there are special cases, for example, the Kubernetes cluster may have some components that utilize the characteristics of the storage clusters, and then these components will be upgraded after the storage clusters.

In both cases, the failed node needs to be deleted from the Kubernetes cluster, and relevant information of the failure does not exist in the Kubernetes cluster any more, namely, the Kubernetes cluster is ensured to be successfully upgraded. The fault node is still reserved in the storage cluster configuration information, so that the fault node is not required to be removed from the storage cluster, and the data migration is avoided, so that the upgrading time is greatly prolonged.

In the process of upgrading the storage cluster, the fault node does not run the service related to the storage cluster because of no corresponding label, and the upgrading process does not need to upgrade and check the corresponding service. For upgrading of the storage clusters and related configuration classes of the cluster node information, special processing needs to be carried out on the fault nodes because the fault nodes are still in the storage clusters, and processing on the fault nodes is filtered, so that the success of upgrading of the storage clusters is ensured.

It should be noted that, the special processing refers to filtering the detected failure node when the failure node is detected during the storage cluster upgrade process. That is, the upgrade process has many checks, because the failed node is still in the storage cluster, if the failed node correlation is not ignored in the checking process, the check can never pass, so that the failed node needs to be filtered, thereby ensuring that the storage cluster is also upgraded successfully.

It will be appreciated that Kubernetes is an open-source container orchestration platform that can automatically deploy, extend, and manage containerized applications. Based on the storage clusters deployed by the Kubernetes, when expanding, only the nodes are added into the Kubernetes clusters, then the nodes are marked with corresponding labels, and corresponding services can be automatically migrated to the newly expanded nodes; when the nodes are deleted from the Kubernetes cluster, the flow is simpler and more mature and the whole process is smoother if the nodes are required to remain in the storage cluster.

According to the method and the device, if the fault node is required to be replaced by a new node in the upgrading process, the characteristic of the Kubernetes label can be utilized to realize that the service related to the fault node is stably migrated to the new replacement node, and the problem that the upgrading time length is greatly increased because the traditional cluster can only wait for fault repair is solved.

When the cluster is not in an upgrade state, the nodes in the storage cluster are deleted, namely the cluster is scaled, which is a normal requirement for the cluster. However, when the cluster is in an upgrade state, if the node fails, the capacity reduction is performed, which is equivalent to performing capacity reduction operation on the failed cluster, so that the operation of the cluster is more unstable.

Therefore, the fault node is not required to be removed from the storage cluster in the upgrading process, and the unstable state of the cluster caused by data migration is avoided. Meanwhile, the fault of the existing nodes of the cluster is avoided, and the capacity reduction processing is performed on the cluster, namely the unstable state of the cluster is prevented from being aggravated.

Referring to fig. 2, fig. 2 is a schematic diagram of a cluster upgrade fault processing apparatus provided by the present invention, where, as shown in fig. 2, the apparatus includes:

the judgment module 201: the method is used for judging whether a fault node occurs in the storage cluster upgrading process of the open source platform Kubernetes deployment.

The determination module 202: and when the fault node is determined to occur, determining whether the fault node needs to be replaced by a new node according to the role information of the fault node in the cluster.

Execution module 203: if yes, deleting the label marked on the fault node by using the open source platform, and marking the deleted label on the new node so as to complete service replacement processing of the fault node in the cluster upgrading process.

Further, in a possible implementation manner, the execution module is further configured to delete the failed node from the open source platform cluster and retain the failed node in a storage cluster embedded in the open source platform cluster when it is determined that the failed node does not need to be replaced;

Further, in a possible implementation manner, the filtering module is configured to filter, when a failed node is detected during the storage cluster upgrade process, the detected failed node.

Further, in a possible implementation manner, the execution module is further configured to sequentially delete the labels marked on the failed nodes by using the open source platform, and sequentially mark the labels deleted by the failed nodes on new nodes with expanded capacity in the open source platform cluster after sequentially deleting the labels marked on the failed nodes;

Further, in a possible implementation manner, the determining module is further configured to determine whether an upgrade fault occurs according to upgrade time and error reporting information of a storage cluster deployed by the open source platform Kubernetes;

Further, in one possible implementation manner, the determining module is further configured to, when determining that a fault node exists, perform fault troubleshooting on all nodes based on a connection relationship between each node and the server or a fault signal, so as to determine the fault node, where the fault signal includes: motherboard fault signals, CPU fault signals, and network card fault signals.

Further, in a possible implementation manner, the judging module is further configured to judge whether the failed node is a master node in the cluster;

Referring to fig. 3, fig. 3 is a schematic diagram of an alternative fault node provided in the present invention, as shown in fig. 3:

the original node can be regarded as a barrier node, and it should be noted that each label on the node in Kubernetes corresponds to a background service. After the Kubernetes expands the new node, the label on the original node is deleted, and then the new expanded node is labeled with the same label. When the Kubernetes cluster detects the newly added label of the node, the corresponding service is automatically pulled up on the newly replaced node, so that the newly replaced node can be ensured to run the corresponding service; the Kubernetes cluster automatically stops corresponding service when detecting the fault node label check. So that the corresponding traffic migrates from the failed node to the normal node.

In addition, when the labels are replaced, the labels are replaced sequentially and not all at once, namely when the labels 1 are deleted, the deleted labels 1 are marked on new nodes, and the later labels are performed sequentially until all the labels are replaced. It should be noted that, after the service corresponding to one label is successfully switched, the next service is switched, otherwise if there is a dependency between the services corresponding to each label, the switching may have a fault.

An electronic device 400 according to such an embodiment of the invention is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 4, the electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: the at least one processing unit 410, the at least one memory unit 420, and a bus 430 connecting the various system components, including the memory unit 420 and the processing unit 410.

Wherein the storage unit stores program code that is executable by the processing unit 410 such that the processing unit 410 performs steps according to various exemplary embodiments of the present invention described in the above-described "example methods" section of the present specification.

The storage unit 420 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.

The storage unit 420 may also include a program/utility 424 having a set (at least one) of program modules 425, such program modules 425 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 430 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 400 may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 400 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 450. Also, electronic device 400 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 460. As shown, the network adapter 460 communicates with other modules of the electronic device 400 over the bus 430. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 400, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

Referring to fig. 5, a program product 500 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A cluster upgrade failure handling method, comprising:

2. The method as recited in claim 1, further comprising:

deleting the fault node from the open source platform cluster when the fault node is determined not to need to be replaced, and reserving the fault node embedded in a storage cluster in the open source platform cluster;

3. The method as recited in claim 2, further comprising:

and in the storage cluster upgrading process, when the fault node is detected, filtering the detected fault node.

4. The method according to claim 1, wherein the deleting the label marked on the failed node by using the open source platform and marking the deleted label on the new node to complete the service replacement processing of the failed node in the cluster upgrade process includes:

the open source platform sequentially deletes the labels marked on the fault nodes, and sequentially marks the labels deleted by the fault nodes on new nodes with expanded capacity in the open source platform cluster after sequentially deleting the labels marked on the fault nodes;

5. The method of claim 1, wherein determining whether a failed node occurs during an upgrade of a storage cluster deployed by the open source platform Kubernetes comprises:

judging whether an upgrade fault occurs according to the upgrade time and the error reporting information of a storage cluster deployed by the open source platform Kubernetes;

6. The method according to claim 5, comprising:

when determining that the fault node exists, performing fault troubleshooting on all nodes based on the communication relation between each node and the server or fault signals so as to determine the fault node, wherein the fault signals comprise: motherboard fault signals, CPU fault signals, and network card fault signals.

7. The method of claim 1, wherein determining whether the failed node needs to be replaced with a new node based on the role information of the failed node in the cluster comprises:

judging whether the fault node is a master node in a cluster;

8. A cluster upgrade failure handling apparatus, comprising:

9. An electronic device, the electronic device comprising:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.

10. A computer readable storage medium, characterized in that it stores computer program instructions, which when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.