WO2024131366A1 - Procédé et appareil de réparation de grappe - Google Patents

Procédé et appareil de réparation de grappe Download PDF

Info

Publication number
WO2024131366A1
WO2024131366A1 PCT/CN2023/130370 CN2023130370W WO2024131366A1 WO 2024131366 A1 WO2024131366 A1 WO 2024131366A1 CN 2023130370 W CN2023130370 W CN 2023130370W WO 2024131366 A1 WO2024131366 A1 WO 2024131366A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cluster
normal
reorganization
nodes
Prior art date
Application number
PCT/CN2023/130370
Other languages
English (en)
Chinese (zh)
Inventor
孟凡辉
张基峰
Original Assignee
中科信息安全共性技术国家工程研究中心有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科信息安全共性技术国家工程研究中心有限公司 filed Critical 中科信息安全共性技术国家工程研究中心有限公司
Publication of WO2024131366A1 publication Critical patent/WO2024131366A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the embodiments of the present application relate to the field of cluster technology, and in particular to a cluster repair method and device.
  • a normal node in the cluster is designated to add the repaired node to the reorganized cluster.
  • the present application also provides a cluster repair device, the device comprising: a monitoring test module, delete module, reorganize module and add module; among them,
  • the monitoring module is used to monitor the operating status of each node in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node; the cluster includes one or more nodes;
  • the deletion module is used to delete the faulty node from the cluster if the cluster cannot operate normally;
  • the reorganization module is used to designate a normal node in the cluster to reorganize the cluster and repair the faulty node;
  • the adding module is used to designate a normal node in the cluster to add the repaired node to the reorganized cluster.
  • the embodiment of the present application proposes a cluster repair method and device, which monitors the operating status of each node in the cluster; and monitors whether the cluster can operate normally according to the operating status of each node; if the cluster cannot operate normally, the faulty node is deleted from the cluster; then a normal node in the cluster is designated to reorganize the cluster and repair the faulty node; and then a normal node in the cluster is designated to add the repaired node to the reorganized cluster.
  • the operating status of each node in the cluster can be monitored in real time; when the cluster cannot operate normally, the node is first designated to complete the cluster reorganization to ensure that the cluster does not stop working, and then the faulty node is analyzed and added to the reorganized cluster after repair.
  • the cluster when the cluster cannot operate normally due to a fault during operation, especially when multiple nodes in the cluster fail or lose power, the cluster will stop working.
  • the cluster repair method and device proposed in the embodiment of the present application can automatically repair the cluster to avoid the impact on the business to the greatest extent; and the technical solution of the embodiment of the present application is simple and convenient to implement, easy to popularize, and has a wider range of application.
  • FIG1 is a schematic diagram of a first process flow of a cluster repair method provided in an embodiment of the present application
  • FIG2 is a schematic diagram of a second process of the cluster repair method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of a third process flow of the cluster repair method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a cluster system architecture provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of a cluster repair device provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG1 is a first flow chart of a cluster repair method provided in an embodiment of the present application.
  • the method can be performed by a cluster repair device or electronic device.
  • the device or electronic device can be implemented by software and/or hardware.
  • the device or electronic device can be integrated into any smart device with network communication function.
  • the cluster repair method may include the following steps:
  • each node in the cluster can monitor its own operating status and the operating status of other nodes in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node.
  • the operating status of each node in the cluster is monitored by a monitoring function module; wherein the monitoring function module is located on each node of the cluster, and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is not limited to the above functions in actual applications, and the module can also realize functions such as cluster reorganization and node addition.
  • Each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: Node A, Node B and Node C; suppose that when Node C fails, the daemon process of Node A and the daemon process of Node B can detect that Node C fails. At this time, the daemon process of Node A and the daemon process of Node B can delete Node C from the cluster. At the same time, you can also use custom rules to specify one of the nodes to complete the deletion of Node C.
  • S103 Designate a normal node in the cluster to reorganize the cluster and repair the faulty node.
  • the cluster has only one normal node, modify the cluster configuration through the node and convert the cluster operation mode to stand-alone mode to avoid the impact of cluster failure.
  • the nodes in the cluster can specify a node as the reorganization execution node through custom rules; then set a node in the cluster as the master node through the reorganization execution node, and reorganize the cluster through the master node.
  • the selected reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information about other normal nodes in the current cluster, and determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node, specifically including the master node modifying the cluster configuration information and the master node synchronizing the database information of this node to other nodes in the reorganized cluster.
  • the designated reorganization execution node needs to complete the analysis and repair of the faulty node problems and handle different fault situations.
  • the repaired node is added to the reorganized cluster using the normal nodes in the specified cluster.
  • the node can be designated as the add execution node to add the repaired node to the cluster; if the cluster has more than one normal node, one of the normal nodes can be designated as the add execution node through a custom rule to add the repaired node to the cluster.
  • Click Join Cluster Usually, the execution node added in this step is the reorganization execution node determined in S103.
  • Adding an execution node adds the repaired node to the reorganized cluster.
  • the specific operation process is as follows: elect a node as the added execution node according to the IP election mechanism, or directly use the reorganized execution node of step S103 as the added execution node, and modify the configuration files of each node in the cluster by adding the execution node; based on the modified configuration files of each node in the cluster, the repaired node is started, and the database information regularly backed up by the monitoring function module on the added execution node is sent to the repaired node to complete the database synchronization.
  • This operation does not affect the normal operation of the original cluster node, and there is no need to perform a locking operation.
  • adding an execution node can also complete synchronization verification and integrity verification to ensure the integrity and consistency of the data.
  • the repaired node can be added to the reorganized cluster in at least the following two scenarios: 1) One or more nodes in the cluster fail, and after the failed node is repaired, the repaired node is added to the cluster; 2) The original cluster needs to be expanded, and the new node needs to be added to the original cluster.
  • a normal node can be selected for full data synchronization according to the Paxos algorithm combined with custom rules, such as IP operation rules.
  • the problem solved by the Paxos algorithm is how a distributed system can reach a consensus on a certain value or a certain resolution.
  • a typical scenario is that in a distributed database system, if the initial state of each node is consistent and each node executes the same sequence of operations, then they can finally get a consistent state.
  • a "consistency algorithm” needs to be executed on each instruction to ensure that the instructions seen by each node are consistent.
  • a general consistency algorithm can be applied in many scenarios and is an important issue in distributed computing.
  • the consistency algorithm can be implemented through shared memory or message passing, and the Paxos algorithm uses the latter.
  • the Paxos algorithm uses the latter.
  • the Paxos algorithm is applicable include: multiple processes/threads in a machine reach data consistency; multiple clients concurrently read and write data in a distributed file system or distributed database; and the consistency of multiple replicas in distributed storage responding to read and write requests.
  • FIG2 is a second flow diagram of the cluster repair method provided in the embodiment of the present application. Based on the above technical solution, it is further optimized and expanded, and can be combined with the above optional implementation methods. As shown in FIG2, the cluster repair method may include the following steps:
  • S201 Monitor the operating status of each node in the cluster through a monitoring function module; wherein the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the operating status of each node in the cluster can be monitored by the monitoring function module; wherein the monitoring function module is located on each node of the cluster, and is used to monitor the operating status of the node and other nodes in the cluster, perform fault analysis and repair, and back up the database.
  • the monitoring function module in the embodiment of the present application can have the following functions: 1) monitor the operating status of the node database and the node status; 2) monitor the operating status of other node databases in the cluster and the node status; 3) use custom rules to specify a node to be responsible for the repair of the faulty node; 4) regularly back up the data information of this node, and be responsible for the full backup of the newly added nodes and the formation of a new cluster.
  • the present application realizes automatic cluster reorganization through the monitoring function module.
  • the monitoring function module on each node monitors its own operating status and the operating status of other node servers in the cluster.
  • the custom mechanism is started, which mainly specifies one of the fault-free nodes to complete the automatic cluster reorganization.
  • the cluster In this step, if the cluster cannot operate normally, the faulty node is deleted from the cluster. If the cluster has only one normal node (fault-free node), the detection function module on the normal node can be used to modify the configuration of the cluster and convert the cluster's operating mode to stand-alone mode. Specifically, if there is only one normal node left in the cluster, the cluster stops working and the cluster database stops updating. When the monitoring function module of the remaining normal node detects this situation, it modifies the relevant configuration and automatically converts the cluster mode to stand-alone mode, so that the normal reading and writing of the new cluster database can be achieved through this normal node.
  • the detection function module on the normal node can be used to modify the configuration of the cluster and convert the cluster's operating mode to stand-alone mode. Specifically, if there is only one normal node left in the cluster, the cluster stops working and the cluster database stops updating.
  • the monitoring function module of the remaining normal node modifies the relevant configuration and automatically converts the cluster mode to stand-alone
  • each node has a daemon process that can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: Node A, Node B, and Node C. Point C; suppose that when node C fails, the daemon of node A and the daemon of node B can detect the failure of node C respectively. At this time, the daemon of node A and the daemon of node B can use custom rules to select a normal node to delete node C from the cluster.
  • a node is designated as a reorganization execution node through a custom rule.
  • a node is designated as the reorganization execution node through a custom rule.
  • the custom rule in the embodiment of the present application may be: take the normal node with the largest or smallest IP address in the current cluster as the reorganization execution node.
  • the embodiment of the present application adopts a custom IP election mechanism, which defines the selection of the node with the largest IP address among all the current normal nodes in the cluster as the cluster reorganization executor.
  • the monitoring process on node C is designated to trigger the cluster reorganization task, and node C is the cluster reorganization execution node.
  • C completes the cluster reorganization, including designating the master node of the reorganized cluster, and the master node completes the startup of each node of the cluster to achieve the normal operation of the reorganized cluster.
  • the startup in the embodiment of the present application refers to the startup of the cluster service, not the startup of the hardware device, which can be understood here as the startup of the software.
  • a normal node in the cluster is set as the master node through the reorganization execution node, the cluster is reorganized through the master node, and the faulty node is repaired.
  • the designated reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information of other normal nodes, and determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node.
  • the current cluster contains three normal nodes A, B, and C, among which node C has the largest IP address.
  • the monitoring process on node C triggers the cluster reorganization task.
  • Node C is the cluster reorganization execution node.
  • C completes the cluster reorganization, including specifying the master node of the reorganized cluster.
  • the master node starts each node in the cluster to achieve normal operation of the reorganized cluster.
  • the monitoring function module on node C sends relevant instructions to node A, sets node A as the master node, modifies the relevant configuration, and starts node A first, and then sends relevant instructions to node B and node C to complete the startup, or the monitoring function module adds other normal nodes to the cluster one after another according to this rule to complete the cluster reorganization.
  • the startup in the embodiment of the present application refers to the startup of the cluster service, not the startup of the hardware device, which can be understood here as the startup of the software.
  • the monitoring service module located on the normal node can delete the faulty node in time and repair the faulty node, and the faulty node is repaired by analyzing the faulty node and taking corresponding repair measures.
  • the faults of the faulty node usually include Mysql crash, process deadlock, network abnormality or power-on abnormality after recovery but the node cannot start automatically.
  • the monitoring service module on the node is set by default to complete the removal and repair of the faulty node; if there is more than one normal node in the cluster at this time, the node specified by the custom rule can be used as the fault repair execution node to realize the repair of the faulty node.
  • the specified fault repair execution node is the reorganization execution node of S203.
  • the reasons for repair in the embodiment of the present application may include but are not limited to: service damage, file loss, service crash, error code, etc.
  • the specific repair method can be to restore the failed node through the original file.
  • the cluster After the faulty node is repaired, or the cluster needs to add a new node to improve the processing capacity, in the existing technology, when adding a node, the cluster adds the execution node and is locked. Its database can only be read but not written. The execution node can only be unlocked after data synchronization with the repaired node or other new nodes is completed, which affects the use of the added execution node and also has certain problems with data synchronization. After the repair is completed, use a normal node in the cluster specified in the above steps to add the repaired node to the reorganized cluster.
  • the node can be designated as an add execution node; the repaired node is added to the cluster; if the cluster has more than one normal node, a node can be designated as an add execution node through a custom rule, and the repaired node is added to the cluster. If the node specified in this operation is the same as the custom rule adopted in the aforementioned step S203, then the node specified in this operation is the same node as the reorganization execution node determined by S203, otherwise they are different.
  • a node is selected as an add execution node according to the custom rule, and the configuration files of each node in the cluster are modified by adding the execution node; the repaired node is started based on the modified configuration files of each node in the cluster, and the database information regularly backed up by the monitoring function module on the add execution node is sent to the repaired node to complete database synchronization.
  • the custom rule in the embodiment of the present application is: take the node with the largest or smallest IP address in the cluster as the reorganization execution node.
  • FIG3 is a third flow chart of the cluster repair method provided in the embodiment of the present application. Based on the above technical solution, it is further optimized and expanded, and can be combined with the above optional implementation methods. As shown in FIG3, the cluster repair method may include the following steps:
  • S301 Monitor the operating status of each node in the cluster through a monitoring function module; wherein the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the operating status of each node in the cluster is monitored through the monitoring function module; the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database. For example, if there are three nodes in the cluster, namely node A, node B, and node C, then a monitoring function module can be set on node A.
  • monitoring function module A is used for monitoring the operating status of node A, node B and node C, fault analysis and repair, and database backup
  • monitoring function module B is used for monitoring the operating status of node B, node A and node C, fault analysis and repair, and database backup
  • monitoring function module C is used for monitoring the operating status of node C, node A and node B, fault analysis and repair, and database backup.
  • each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: node A, node B and node C; assuming that when node C fails, the daemon process of node A and the daemon process of node B can detect that node C fails.
  • the daemon process of node A and the daemon process of node B can delete node C from the cluster respectively, or use custom rules to specify one of the nodes to complete the deletion of the faulty node C.
  • a node is designated as a reorganization execution node through a custom rule.
  • a node as the reorganization execution node through a custom rule. For example, suppose there are three nodes in the cluster, namely node A, node B, and node C; suppose node C fails, then node C can be deleted from the cluster; then suppose node A is the node with the largest IP address, then node A can be used as the reorganization execution node.
  • S304 trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operation, obtain information of other normal nodes, determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node.
  • the faulty node is repaired through the reorganization execution node selected in S303.
  • the reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information of other normal nodes, and determine the data through the information of other normal nodes.
  • the node with the latest database information is set as the master node, and the cluster information configuration and data synchronization are completed through the master node.
  • each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: node A, node B and node C; assuming that when node C fails, the protection process of node A and the protection process of node B can respectively detect that node C has failed. At this time, the protection process of node A and the protection process of node B can respectively delete node C from the cluster, and specify one of the nodes to perform the deletion operation of node C through custom rules such as the maximum or minimum IP rule.
  • node A is the node with the largest IP address
  • the node can be used as the reorganization execution node; at this time, the monitoring function module of node A can be triggered to perform the cluster reorganization operation, obtain the information of node B, and determine the node with the latest database information through the information of node B; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node. If there are multiple nodes with the latest database information, it is necessary to use the custom rules again to determine one of the nodes as the master node to complete the reorganization cluster information configuration and data synchronization.
  • the cluster has only one normal node, you can designate that node as the add execution node; add the repaired node to the cluster; if the cluster has more than one normal node, you can designate a node as the add execution node through custom rules, and add the repaired node to the cluster.
  • FIG 4 is a schematic diagram of the cluster system architecture provided in an embodiment of the present application.
  • the system may include: a management unit, a storage unit, a scheduling unit and a computing unit; wherein the administrator sends an http request to the management unit through a browser to implement management operations on the cluster.
  • the management unit may include: system management services, business management services, system monitoring services and upgrade services.
  • the storage unit may include N units, namely storage unit 1, storage unit 2, ..., storage unit N; wherein N is a natural number greater than 1; the dispatcher may send an http request to the scheduling unit to implement scheduling operations on the cluster.
  • the computing unit may include: signature and encryption services.
  • FIG5 is a schematic diagram of the structure of a cluster repair device provided in an embodiment of the present application.
  • the cluster repair device includes: a monitoring module 501, a deletion module 502, a reorganization module 503 and an addition module 504;
  • the monitoring module 501 is used to monitor the operating status of each node in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node; the cluster includes one or more nodes;
  • the deletion module 502 is used to delete the faulty node from the cluster if the cluster cannot operate normally;
  • the reorganization module 503 is used to designate a normal node in the cluster to reorganize the cluster and repair the faulty node;
  • the adding module 504 is used to designate a normal node in the cluster to add the repaired node to the reorganized cluster.
  • the above cluster repair device can execute the method provided by any embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • the cluster repair method provided by any embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG6 shows a block diagram of an exemplary electronic device suitable for implementing an embodiment of the present application.
  • the electronic device can be any node in a cluster.
  • the electronic device 12 shown in FIG6 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
  • the electronic device 12 is in the form of a general purpose computing device.
  • the components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor or a local bus using any of a variety of bus architectures.
  • these architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an Enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • the electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device 12, including volatile and non-volatile media, removable and non-removable media.
  • the system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
  • the electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 6 , commonly referred to as a “hard drive”).
  • a disk drive for reading and writing a removable non-volatile disk such as a “floppy disk”
  • an optical disk drive for reading and writing a removable non-volatile optical disk (such as a CD-ROM, DVD-ROM or other optical media) may be provided.
  • each drive may be connected to the bus 18 via one or more data medium interfaces.
  • the memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to perform the functions of the various embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in the memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment.
  • the program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • the electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboards, pointing devices, displays 24, etc.), may also communicate with one or more devices that enable a user to interact with the electronic device 12, and/or communicate with any device that enables the electronic device 12 to communicate with one or more other computing devices (e.g., network cards, modems, etc.). Such communication may be performed via an input/output (I/O) interface 22.
  • the electronic device 12 may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) via a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via a bus 18.
  • LANs local area networks
  • WANs wide area networks
  • public networks such as the Internet
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the cluster repair method provided in the embodiment of the present application.
  • An embodiment of the present application provides a computer storage medium.
  • the computer-readable storage medium of the embodiments of the present application may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
  • a computer readable storage medium may be a computer programmable memory device, a computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, which carry computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operation of the present application can be written in one or more programming languages or a combination thereof, including object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code can be executed entirely on the user's computer, partially on the user's computer, as an independent software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computer (e.g., using an Internet service provider to connect through the Internet).
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

L'invention concerne un procédé et un appareil de réparation de grappe. Le procédé consiste à : surveiller un état de fonctionnement de chaque nœud dans une grappe, et selon l'état de fonctionnement de chaque nœud, surveiller si la grappe peut fonctionner normalement (101) ; si la grappe ne peut pas fonctionner normalement, supprimer un nœud défectueux de la grappe (102) ; désigner un nœud normal dans la grappe pour regrouper la grappe, de façon à réparer le nœud défectueux (103) ; et désigner un nœud normal dans la grappe pour ajouter le nœud réparé à la grappe regroupée (104). La grappe peut être automatiquement réparée, évitant ainsi au maximum l'impact sur un service.
PCT/CN2023/130370 2022-12-21 2023-11-08 Procédé et appareil de réparation de grappe WO2024131366A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211651715.4 2022-12-21
CN202211651715.4A CN115904822A (zh) 2022-12-21 2022-12-21 一种集群修复方法及装置

Publications (1)

Publication Number Publication Date
WO2024131366A1 true WO2024131366A1 (fr) 2024-06-27

Family

ID=86493371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/130370 WO2024131366A1 (fr) 2022-12-21 2023-11-08 Procédé et appareil de réparation de grappe

Country Status (2)

Country Link
CN (1) CN115904822A (fr)
WO (1) WO2024131366A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
CN106933693A (zh) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 一种数据库集群节点故障自动修复方法及系统
CN111124755A (zh) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 集群节点的故障恢复方法、装置、电子设备及存储介质
CN112650624A (zh) * 2020-12-25 2021-04-13 浪潮(北京)电子信息产业有限公司 一种集群升级方法、装置、设备及计算机可读存储介质
CN112866408A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 一种集群中业务切换方法、装置、设备及存储介质
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100264896B1 (ko) * 1998-07-27 2000-09-01 윤종용 다중 클러스터 시스템의 클러스터 노드 고장 감지 장치 및 방법
US7953860B2 (en) * 2003-08-14 2011-05-31 Oracle International Corporation Fast reorganization of connections in response to an event in a clustered computing system
CN105915405A (zh) * 2016-03-29 2016-08-31 深圳市中博科创信息技术有限公司 一种大型集群节点性能监控系统
CN111901422B (zh) * 2020-07-28 2022-11-11 浪潮电子信息产业股份有限公司 一种集群中节点的管理方法、系统及装置
CN113326100B (zh) * 2021-06-29 2024-04-09 深信服科技股份有限公司 一种集群管理方法、装置、设备及计算机存储介质
CN114301802A (zh) * 2021-12-27 2022-04-08 北京吉大正元信息技术有限公司 密评检测方法、装置和电子设备
CN114816820A (zh) * 2022-04-26 2022-07-29 平安普惠企业管理有限公司 chproxy集群故障修复方法、装置、设备及存储介质
CN115499447A (zh) * 2022-09-15 2022-12-20 北京天融信网络安全技术有限公司 一种集群主节点确认方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
CN106933693A (zh) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 一种数据库集群节点故障自动修复方法及系统
CN111124755A (zh) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 集群节点的故障恢复方法、装置、电子设备及存储介质
CN112650624A (zh) * 2020-12-25 2021-04-13 浪潮(北京)电子信息产业有限公司 一种集群升级方法、装置、设备及计算机可读存储介质
CN112866408A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 一种集群中业务切换方法、装置、设备及存储介质
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Also Published As

Publication number Publication date
CN115904822A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
KR102268355B1 (ko) 클라우드 배치 기반구조 검증 엔진
US20190073258A1 (en) Predicting, diagnosing, and recovering from application failures based on resource access patterns
US8255653B2 (en) System and method for adding a storage device to a cluster as a shared resource
US8966318B1 (en) Method to validate availability of applications within a backup image
US9678682B2 (en) Backup storage of vital debug information
CN109325016B (zh) 数据迁移方法、装置、介质及电子设备
US9189338B2 (en) Disaster recovery failback
US7624309B2 (en) Automated client recovery and service ticketing
US11144405B2 (en) Optimizing database migration in high availability and disaster recovery computing environments
WO2024131366A1 (fr) Procédé et appareil de réparation de grappe
US9436539B2 (en) Synchronized debug information generation
WO2012053085A1 (fr) Dispositif et procédé de commande de stockage
JP4239989B2 (ja) 障害復旧システム、障害復旧装置、ルール作成方法、および障害復旧プログラム
CN111522703A (zh) 监控访问请求的方法、设备和计算机程序产品
CN108833164B (zh) 服务器控制方法、装置、电子设备及存储介质
CN111581021B (zh) 应用程序启动异常的修复方法、装置、设备及存储介质
WO2015043155A1 (fr) Procédé et dispositif de sauvegarde et de récupération d'élément de réseau sur la base d'un ensemble de commandes
CN112261114A (zh) 一种数据备份系统及方法
US20140201566A1 (en) Automatic computer storage medium diagnostics
WO2020233001A1 (fr) Système de stockage distribué comprenant une architecture à double commande, procédé et dispositif de lecture de données, et support de stockage
JP5352027B2 (ja) 計算機システムの管理方法及び管理装置
CN111381770B (zh) 一种数据存储切换方法、装置、设备及存储介质
CN105760456A (zh) 一种保持数据一致性的方法和装置
US20220129446A1 (en) Distributed Ledger Management Method, Distributed Ledger System, And Node
US20120191645A1 (en) Information processing apparatus and database system