WO2015088324A2 - Système et procédé de gestion de nœud défaillant dans un système informatique distribué - Google Patents

Système et procédé de gestion de nœud défaillant dans un système informatique distribué Download PDF

Info

Publication number
WO2015088324A2
WO2015088324A2 PCT/MY2014/000206 MY2014000206W WO2015088324A2 WO 2015088324 A2 WO2015088324 A2 WO 2015088324A2 MY 2014000206 W MY2014000206 W MY 2014000206W WO 2015088324 A2 WO2015088324 A2 WO 2015088324A2
Authority
WO
WIPO (PCT)
Prior art keywords
faulty node
network
faulty
node
mode
Prior art date
Application number
PCT/MY2014/000206
Other languages
English (en)
Other versions
WO2015088324A3 (fr
Inventor
Fairus Bin Khalid MOHAMMAD
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015088324A2 publication Critical patent/WO2015088324A2/fr
Publication of WO2015088324A3 publication Critical patent/WO2015088324A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems

Definitions

  • the present invention relates to distributed computing, and more particularly to a system method for managing a faulty node in a clustered and distributed computing system.
  • Clustering computing has becoming an increasingly preferred type of distributed computing in deployment of service applications in organizations and more particularly for e- commerce-based systems.
  • Clustering provides higher availability, reliability, and scalability, in addition to its failover and load balancing capacity.
  • the failover and load balancing feature permits redistribution of workload between operating nodes and thus continuous availability of applications and data, even in the event that one of the connected devices or nodes stops working or in occurrence of faulty nodes.
  • Such technical feature prevents undesirable behaviors of the faulty node in disrupting the other operating nodes within the cluster.
  • a great majority of these clustered and distributed systems are built as
  • FIG 1 shows a typical virtual clustering system, whereby there can be a plurality of virtual machines (VM) being distributed in multiple sites.
  • VM virtual machines
  • Each site a service module (IaaS) management module connected via management network module which is typically used for communications between the service module , i.e. Infrastructure as a service (IaaS) with the hosts.
  • the system further includes a production network module for communications between VMs and users.
  • the faulty VM might misbehave and disrupt other VM operations within the cluster.
  • the user may need to acquire permission from multiple hosts or administrators within the distributed system.
  • present invention relates to a system for managing a faulty node ; the system having a plurality of nodes being interconnected on a network, the system comprising: a user interface module configured for communication with the faulty node; an out-of-band interface module configured to act as a console to access the faulty node and for permitting the isolation, diagnose and inserting the faulty node back into network to be performed in an out-of-band-manner; a network topology discovery module configured to provide network topology discovery including all network linkages which may be related to the faulty node; a j
  • switch registry module configured to extract and store all data related to switches within the network, including switch access information and credentials; a maintenance database module configured to store information associated to the faulty node; a switch gateway module configured to act as an interface to the switches within the network and sending commands to the switches connected to the faulty node; and a service model gateway module configured to act as a management interface for the system to the network.
  • the system is configured to establish a connection between a user and the faulty node and perform the following functionalities; a "fencing on" mode to enable isolation of faulty node from the network; a maintenance-on mode to diagnose the faulty node through the out-of-band interface module; a maintenance-off mode to take the faulty node off from the maintenance-on mode upon diagnosed and fixed; and a fencing-off mode to insert back the node into service and the network.
  • system is further configured to receive information related to identification of faulty node, query the validity and authorization of user; extract information related to the faulty node, send commands to disable or enable the ingress and egress of traffic associated to faulty node.
  • the system is used for managing a faulty node in a virtual cluster system.
  • the present invention relates to a method for managing a faulty node in a system comprising a plurality of nodes interconnected on a network, the method comprising the steps of: establishing a connection between a user and the faulty node; performing a fencing-on mode to enable isolation of faulty node from the network; performing a maintenance-on mode to diagnose the faulty node through the out-of-band interface module; performing a maintenance-off mode to take the faulty node off from the maintenance-on mode upon diagnosed and fixed; and performing a fencing-of mode to insert back the node into service and the network.
  • the "fencing on" mode comprises receiving a request associated the identification of the faulty node to be isolated; querying the validity and authorization of request to connect to isolate the faulty node; validating and authorizing the request; extracting information associated to faulty node; scanning and finding switches connected to the faulty node ; storing information of the scanned and found switches; querying the switches access information and credentials of the network sending a command to all switches connected to the faulty node based on the information attained in the previous step to disable the ingress and egress of traffic of the faulty node ; and marking the faulty node as in an "isolated" step .
  • the maintenance-on mode comprises receiving a request to diagnose the faulty node; querying the validity and authorization of the request; validating and authorizing the request; extracting information associated to faulty node; establishing console access to the faulty node; querying the switches access information; sending a command to all switches connected to the faulty node based on the information attained in the previous step to disable the ingress and egress of traffic of the faulty node; providing an out-of-band access to the faulty node ; marking the faulty node as in a "diagnose" state.
  • the maintenance-off mode comprises receiving a request to remove the faulty node from diagnose-state; querying the validity and authorization of the request; validating and authorizing the request; querying information related to faulty node and switches access information; disabling the console access to the faulty node; querying the switches information; sending a command to all switches connected to the faulty node based on the information attained in the previous step to disable the ingress and egress of traffic of the faulty node; and marking the faulty node as in an isolated-state .
  • the fencing-off mode comprises receiving a request to insert back the faulty node into service and network; querying the validity and authorization of request; querying information related to faulty node and switches connected to the faulty node ; querying switches information; sending commands to the switches to enable the ingress and egress traffic to and from the node; marking the faulty node as in service .
  • FIG 1 shows an example of an overall architecture of a typical virtual clustering system
  • FIG 2 illustrates the modules of maintenance system in accordance with an embodiment of the present invention
  • FIG 3 shows the virtual clustering system installed with the maintenance system in accordance with an embodiment of the present invention
  • FIG 4 shows the main processes and functionalities of the maintenance system in accordance with an embodiment of the present invention
  • FIG 5 illustrates a flowchart containing the steps involved in the fencing on mode in accordance with an embodiment of the present invention
  • FIG 6 illustrates a flowchart containing the steps involved in the maintenance on mode in accordance with an embodiment of the present invention
  • FIG 7 illustrates a flowchart containing the steps involved in the maintenance off mode in accordance with an embodiment of the present invention.
  • FIG 8 illustrates a flowchart containing the steps involved in the fencing off mode in accordance with an embodiment of the present invention.
  • the present invention generally relates to a maintenance system configured for managing a faulty node detected within a clustered and distributed system and a method for managing a faulty node within a clustered and distributed system.
  • FIG 2 illustrates one of the sites that employs a maintenance system in accordance with one embodiment of the present invention.
  • the maintenance system is adapted and configured to assist a user in managing a faulty node identified in a clustered and distributed system, the nodes being interconnected by a network.
  • the maintenance system 50 comprises a web service module 100, an out-of-band interface module 200, a network topology discovery module 300, a switch registry 400, a maintenance database module 500, a switch interface module 600 and a service model gateway module 700.
  • the web service module 100 can be the form of web-service or a web-based interface adapted for communication with a user's device in order to manage and fix the faulty node within the system.
  • The-out-of-band interface module 200 is configured to provide or act as a console to access the faulty node within the system, while the network topology discovery module 300 is configured to provide network topology discovery including all network linkages which may be related to the faulty node.
  • the switch registry module 400 is configured to store all data related to switches, such information may include switches information, access information, capability and network segments.
  • the maintenance database module 500 is configured to store information associated to the faulty node.
  • the means configured to act as an interface to switches can be in the form of a switch gateway 600, including sending commands to switches which are found connected to the faulty node to enable or disable the ingress and egress of traffic to and from the faulty node; while the service model gateway module 700 is configured to act as a management interface to the service model management within the system.
  • the preferred service model gateway is subjected to type of service model incorporated for the system, for example, Infrastructure as a Service, (Iaas).
  • FIG 3 illustrates a virtual clustering system that includes the maintenance system 50 being interconnected within a network in accordance with one embodiment of the present invention.
  • the virtual clusters are interconnected by a network, and may be adapted to host at least one application or a site on the network. Each host has an IP address that corresponds to applications which are being hosted within the cluster
  • the virtual clustered system includes a user device 410, a plurality of VMs 420 disposed within one host 440 within one site 430, a plurality of hosts 440 for each site, a management network means 460, an IaaS management component 470, a host switch 475, a switch 476 connected to the maintenance system 50, a default production network 485 for VMs 420 communication and an maintenance network 490 for connecting the user 410 to the site containing the faulty VM 480.
  • users can access the site, which contains a faulty VM 480, the faulty node's network, IaaS and maintenance services network.
  • the maintenance system 50 of the present invention provides access to the network by establishing communication with the management network module 460 within the cluster.
  • the communication between the faulty VM 480 and the user 410 is therefore established via the maintenance system 50 using an maintenance network link 490 which may be scanned and discovered to be linked to the faulty VM 480.
  • the maintenance system Upon installed within the system, the maintenance system then carries out a process for managing faulty VM 480 in accordance with an embodiment of the present invention.
  • the process may be performed automatically while being administered by a user device 410 upon installation of the maintenance system 50 within the system.
  • the maintenance system 50 in accordance with an embodiment of the present invention is configured to provide a fencing-on, maintenance-on, maintenance- off and fencing-off modes or functionalities.
  • Fencing ON mode at step 401 enables the owner or user of the faulty VM 480 to isolate the faulty VM 480 at step 402 from a respective network so as to prevent disruption to other nodes.
  • the maintenance mode is switched ON at step 403 whereby the faulty node may be diagnose or terminated at step 404, during such process, the maintenance system 50 enables the owner or user to access and diagnose the faulty VM 480 through the out-of-band interface 200.
  • FIG 5 illustrates a flowchart containing the steps involved in the fencing -on mode 401 from FIG 4. in accordance with an embodiment of the present invention.
  • the maintenance system 50 receives a request by a user 410 associated the identification of the faulty VM 480 to be isolated at step 501.
  • the maintenance system 50 queries the Iaas management module 450 within the system via the Iaas gateway module 70O at step 502 to check on the validity and authorization of the user's 410 request In the event that the Iaas user, being user device 410 and request is validated and authorized at step 503, the maintenance system 50 proceeds to extract information associated to faulty VM 480, performed also by the Iaas gateway module 700 , such information includes number of network interfaces, EP addresses and host information at step 504. In the event that the user 410 is not validated authorized, the request to isolate the faulty VM 480 is rejected and thus the fencing-on mode is halted at step 410.
  • the network topology module 300 of the maintenance system 50 finds or scans for the switches connected to the faulty VM's 480 IP addresses.
  • the extracted information from the previous step is stored in a maintenance database 500 at step 506.
  • the maintenance system 50 queries the switches access information and credentials of the network from the switch registry module 400 at step 507, and then sending a command to all switches connected to the faulty VM's 480 IP addresses based on the information attained in the previous step via the switch gateway module 600 to disable the ingress and egress of traffic of the B? addresses, at step 508.
  • the VM is marked as "isolated" 402 by the maintenance system 50.
  • the process is terminated at step 510.
  • the maintenance system 50 Upon isolation of the faulty VM 480, the maintenance system 50 then continues to a maintenance-on mode 403 from FIG 4, as illustrated in FIG 6.
  • This modes starts at step 601, the maintenance system 50 receives request by the user 410 to diagnose the faulty VM 480 and proceeds to query the laas management 450 which may be performed via the laas gateway module 700 at step 602 to check on the user's 410 and request validity and authorization in the request and conducting the ongoing operation .
  • the maintenance system 50 continues to query the maintenance database 500 for the faulty VM 480 information and connected switches at step 604, also via the laas gateway module 700.
  • the maintenance system 50 then proceeds to establish a console access via the out-of-band module 200 to the faulty VM 480 at step 605, whereby the console access is established from the maintenance system 50 to the host of the holding the faulty VM 480.
  • the maintenance system 50 queries the switches access information and credentials from the switch registry 400 at step 606.
  • the maintenance system 50 then sends commands at step 607 to all switches which are connected to the based on the information attained from the previous step via the switch gateway module 600, so as to enable the ingress and egress console traffic between the user device 410 to the maintenance system 50.
  • the out-of-band interface module 200 then enable the user device 410 at step 608 to access the network via the maintenance system 50.
  • the user 410 can begin the diagnose process at step 609 where the faulty VM 480 is marked to be in a "Diagnose" 404 status.
  • the maintenance ON mode ends at step 610.
  • FIG 7 illustrates the maintenance-off 405 mode from FIG 4 provided by the maintenance system 50 in accordance with an embodiment of the present invention.
  • the mode starts at step 701, where the maintenance system 50 receives a request from the user 410 that the faulty VM 480 is now fixed and to be taken out from the "diagnose" 304 state. Proceeding to step 702, the maintenance system 50 queries the laas management 450 to check the validity and authorization of the operation via the laas gateway module 700. In the event that the user 410 is not validated or authorized, the mode is terminated at step 709.
  • the maintenance system 50 Upon validated and authorized at step 703, the maintenance system 50 then queries the maintanence database module 500 for information related to the VM 480 and its connected switches at step 704, whereby these are performed also via the laas gateway module 700. Then at step 705, the system 50 disables console access to the VM 480 through the respective host 440 holding the VM 480. In the next step 706, the maintenance system 50 queries the switches access information and credentials from the switch registry module 400. The maintenance system 50 then sends commands via the switch gateway module 600 to all switches connected to the VM 480 IP address so as to disable the ingress and egress traffic of the IP addresses associated to the VM 480, at step 707. Next, the VM 480 is marked "isolated" 402 at step 708 by the maintenance system 50.The maintenance-off mode ends at step 709.
  • FIG 8 illustrates the fencing-off mode 406 from FIG 4, provided by the maintenance system 50 in accordance with an embodiment of the present invention.
  • the mode starts at step 801, where the maintenance system 50 receives a request from a user 410 that the faulty VM 480 is now fixed and to be inserted back into service. Proceeding to step 802, the system 50 queries the laas management 450 to check the validity and authorization of the user 410 and the request via the laas gateway module 700. Upon validated and authorized at step 803, the maintenance system 50 then queries the maintanence database module 500 for information related to the VM 480 and its connected switches at step 804. At step 805, the switches access information and credentials of the faulty VM 480 are extracted from the switch registry module 400.
  • the maintenance system 50 then sends commands to all switches connected to the VM 480 IP address via the switch gateway 600 so as to enable the ingress and egress traffic of the IP addresses associated to the VM 480, at step 806.
  • the VM 480 is unmarked as "isolated” at step 807 by the system 50.
  • the fencing-off 406 mode ends at step 808.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

La présente invention concerne un système (50) de gestion de nœud défaillant (480) dans un système comprenant une pluralité de nœuds interconnectés sur un réseau. Le système (50) comprend : un module d'interface utilisateur (100) ; un module d'interface hors bande (200) ; un module de recherche de topologie de réseau (300) ; un module de registre de commutation (400) ; un module de base de données de maintenance (500) ; un module de passerelle de commutation (600) ; et un module de passerelle de modèle de service (700) configuré pour servir d'interface de gestion au système (50) vis-à-vis du réseau.
PCT/MY2014/000206 2013-12-09 2014-06-27 Système et procédé de gestion de nœud défaillant dans un système informatique distribué WO2015088324A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2013004416A MY177535A (en) 2013-12-09 2013-12-09 System and method for managing a faulty node in a distributed computing system
MYPI2013004416 2013-12-09

Publications (2)

Publication Number Publication Date
WO2015088324A2 true WO2015088324A2 (fr) 2015-06-18
WO2015088324A3 WO2015088324A3 (fr) 2015-09-03

Family

ID=51703373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000206 WO2015088324A2 (fr) 2013-12-09 2014-06-27 Système et procédé de gestion de nœud défaillant dans un système informatique distribué

Country Status (2)

Country Link
MY (1) MY177535A (fr)
WO (1) WO2015088324A2 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10754720B2 (en) 2018-09-26 2020-08-25 International Business Machines Corporation Health check diagnostics of resources by instantiating workloads in disaggregated data centers
US10761915B2 (en) 2018-09-26 2020-09-01 International Business Machines Corporation Preemptive deep diagnostics and health checking of resources in disaggregated data centers
US10831580B2 (en) 2018-09-26 2020-11-10 International Business Machines Corporation Diagnostic health checking and replacement of resources in disaggregated data centers
US10838803B2 (en) 2018-09-26 2020-11-17 International Business Machines Corporation Resource provisioning and replacement according to a resource failure analysis in disaggregated data centers
CN112214466A (zh) * 2019-07-12 2021-01-12 海能达通信股份有限公司 分布式集群系统及数据写入方法、电子设备、存储装置
US11050637B2 (en) 2018-09-26 2021-06-29 International Business Machines Corporation Resource lifecycle optimization in disaggregated data centers
US11188408B2 (en) 2018-09-26 2021-11-30 International Business Machines Corporation Preemptive resource replacement according to failure pattern analysis in disaggregated data centers
CN115134213A (zh) * 2021-03-25 2022-09-30 中国移动通信集团安徽有限公司 一种容灾方法、装置、设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7355966B2 (en) * 2003-07-16 2008-04-08 Qlogic, Corporation Method and system for minimizing disruption in common-access networks
US7478152B2 (en) * 2004-06-29 2009-01-13 Avocent Fremont Corp. System and method for consolidating, securing and automating out-of-band access to nodes in a data network
US8380828B1 (en) * 2010-01-21 2013-02-19 Adtran, Inc. System and method for locating offending network device and maintaining network integrity

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10754720B2 (en) 2018-09-26 2020-08-25 International Business Machines Corporation Health check diagnostics of resources by instantiating workloads in disaggregated data centers
US10761915B2 (en) 2018-09-26 2020-09-01 International Business Machines Corporation Preemptive deep diagnostics and health checking of resources in disaggregated data centers
US10831580B2 (en) 2018-09-26 2020-11-10 International Business Machines Corporation Diagnostic health checking and replacement of resources in disaggregated data centers
US10838803B2 (en) 2018-09-26 2020-11-17 International Business Machines Corporation Resource provisioning and replacement according to a resource failure analysis in disaggregated data centers
US11050637B2 (en) 2018-09-26 2021-06-29 International Business Machines Corporation Resource lifecycle optimization in disaggregated data centers
US11188408B2 (en) 2018-09-26 2021-11-30 International Business Machines Corporation Preemptive resource replacement according to failure pattern analysis in disaggregated data centers
CN112214466A (zh) * 2019-07-12 2021-01-12 海能达通信股份有限公司 分布式集群系统及数据写入方法、电子设备、存储装置
CN112214466B (zh) * 2019-07-12 2024-05-14 海能达通信股份有限公司 分布式集群系统及数据写入方法、电子设备、存储装置
CN115134213A (zh) * 2021-03-25 2022-09-30 中国移动通信集团安徽有限公司 一种容灾方法、装置、设备及存储介质
CN115134213B (zh) * 2021-03-25 2023-09-05 中国移动通信集团安徽有限公司 一种容灾方法、装置、设备及存储介质

Also Published As

Publication number Publication date
WO2015088324A3 (fr) 2015-09-03
MY177535A (en) 2020-09-17

Similar Documents

Publication Publication Date Title
WO2015088324A2 (fr) Système et procédé de gestion de nœud défaillant dans un système informatique distribué
US20190207812A1 (en) Hybrid cloud network configuration management
US9270650B2 (en) System and method for providing secure subnet management agent (SMA) in an infiniband (IB) network
US8463885B2 (en) Systems and methods for generating management agent installations
US8670349B2 (en) System and method for floating port configuration
US10003458B2 (en) User key management for the secure shell (SSH)
US8135989B2 (en) Systems and methods for interrogating diagnostic target using remotely loaded image
US8577044B2 (en) Method and apparatus for automatic and secure distribution of an asymmetric key security credential in a utility computing environment
US8713649B2 (en) System and method for providing restrictions on the location of peer subnet manager (SM) instances in an infiniband (IB) network
US7822982B2 (en) Method and apparatus for automatic and secure distribution of a symmetric key security credential in a utility computing environment
US20110055810A1 (en) Systems and methods for registering software management component types in a managed network
EP2658207B1 (fr) Procédé d'autorisation et dispositif terminal
CN106911648B (zh) 一种环境隔离方法及设备
US10404472B2 (en) Systems and methods for enabling trusted communications between entities
US9094409B2 (en) Method for configuring access rights, control point, device and communication system
EP2754278A2 (fr) Système et procédé pour prendre en charge au moins une ou plusieurs restrictions de pare-feu de paquet de gestion de sous-réseau (smp) et une protection de trafic dans un environnement de machine interlogicielle
US20180198616A1 (en) Host-storage authentication
CN105049412A (zh) 一种不同网络间数据安全交换方法、装置及设备
US20070157308A1 (en) Fail-safe network authentication
JP2013187707A (ja) ホスト提供システム及び通信制御方法
CN108366087B (zh) 一种基于分布式文件系统的iscsi服务实现方法和装置
US10402282B1 (en) Assisted device recovery
US20060212560A1 (en) Systems and methods for denying rogue DHCP services
CN107342972B (zh) 一种实现远程访问的方法及装置
CN113014565B (zh) 实现防端口扫描的零信任架构及服务端口访问方法和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14784125

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14784125

Country of ref document: EP

Kind code of ref document: A2