WO2016058307A1 - 资源的故障处理方法及装置 - Google Patents

资源的故障处理方法及装置 Download PDF

Info

Publication number
WO2016058307A1
WO2016058307A1 PCT/CN2015/072923 CN2015072923W WO2016058307A1 WO 2016058307 A1 WO2016058307 A1 WO 2016058307A1 CN 2015072923 W CN2015072923 W CN 2015072923W WO 2016058307 A1 WO2016058307 A1 WO 2016058307A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource
specified resource
node
specified
service
Prior art date
Application number
PCT/CN2015/072923
Other languages
English (en)
French (fr)
Inventor
陈重文
宋亚东
谢型果
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016058307A1 publication Critical patent/WO2016058307A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements

Definitions

  • the present invention relates to the field of communications, and in particular to a method and apparatus for processing a fault of a resource.
  • Network attached storage systems are widely used in enterprise management platforms. The security and reliability of their performance can be directly related to the daily operations of enterprises. Therefore, network attached storage systems need to ensure stable and high availability.
  • the causes of abnormal system operation can be mainly divided into the following aspects: application problems (40%), operation problems (40%), operating system failures (10%), and hardware failures (10%).
  • application problems (40%)
  • operation problems (40%)
  • operating system failures (10%)
  • hardware failures (10%).
  • the software and hardware resources of a certain access storage network port and a backend storage resource may be abnormal.
  • other modules are running normally.
  • the technical solution adopted in the prior art is to isolate the entire node and transfer the service to other normal operations. The node is up, and the above technical solution will make the whole takeover process complicated, and the probability of error increases accordingly.
  • the entire takeover takes a long time, and the load of the takeover node increases correspondingly after the takeover succeeds, bringing the entire storage service process. pressure.
  • the fault management module mainly manages the storage resources on the local node, and the abnormal processing of the module itself is implemented by re-election of the nodes to generate a new takeover node.
  • the election algorithm is best known for its Paxos algorithm. It is used in several open source projects, but the single-instance election of basic node objects cannot solve the election of multiple specific object resources in the node.
  • the resource failure on the node belongs to a partial failure in many cases, the node is still isolated, and the service of the node is transferred to other takeover nodes, which results in a complicated takeover process, is prone to error, and also increases.
  • the problem of taking over the load of the node has not yet proposed an effective solution.
  • the present invention provides a resource fault processing method and apparatus.
  • a method for processing a fault of a resource includes: monitoring whether a specified resource of a node in a network storage cluster system is faulty, wherein the specified resource is pre-divided in the network storage cluster system The resource corresponding to the resource type is specified in the resource type; when the specified resource fails, the target object that takes over the specified resource is selected according to a preset policy.
  • monitoring whether the specified resource of the node in the network storage cluster system is faulty comprises: allocating resource types of resources of all nodes in the network storage cluster system; configuring resources of the same resource type in all nodes as one a service group; determining whether the specified resource is faulty by detecting a status of the specified resource in the service group.
  • the specified resource is determined to be faulty when the physical network port status of the specified resource is changed from the running state to the standby state, and determining that the specified resource is faulty.
  • selecting a target object that takes over the specified resource including: selecting a service unit that takes over the specified resource in a service group in which the specified resource is located; using the node where the service unit is located as target.
  • the service unit that takes over the specified resource is selected in a service group in which the resource is located by one of the following methods: selecting the service unit from the service group according to a preset priority; according to the service group The value of the IP address of the service unit in the selection selects the service unit.
  • the method further includes: saving the switching information of the specified resource, where the switching information includes at least one of the following: The original node information, the resource type corresponding to the specified resource; when the original node where the specified resource is located fails to recover, the designated resource is switched back to the original node according to the switching information.
  • a fault processing apparatus for a resource including: a monitoring module, configured to monitor whether a specified resource of a node in a network storage cluster system is faulty, wherein the specified resource is The network storage cluster system has a resource corresponding to the specified resource type in the pre-divided resource type; the selecting module is configured to: when the specified resource fails, select a target object that takes over the specified resource according to a preset policy.
  • the monitoring module includes: a dividing unit configured to perform resource type division on resources of all nodes in the network storage cluster system; and a configuration unit configured to configure resources of the same resource type in all nodes as a service group; the determining unit, configured to determine whether the specified resource is faulty by detecting a status of the specified resource in the service group.
  • the determining unit is configured to determine that the designated resource is faulty when the physical network port status of the specified resource is changed from the running state to the standby state.
  • the selecting module includes: a selecting unit, configured to select a service unit that takes over the specified resource in a service group in which the specified resource is located; and a determining unit configured to use a node where the service unit is located Service unit.
  • the present invention after the resources on the node are classified, when the designated resource fails, only the failed resource can be transferred to the technical solution on the other node, and the resources in the related technology are solved in many cases.
  • the fault belongs to some faults, but the node is still isolated, and the node's service is transferred to other takeover nodes.
  • the takeover process is complicated, error-prone, and also increases the load of the takeover node, simplifying the takeover process and reducing The error rate is also reduced, and the load on the takeover node is also less.
  • FIG. 1 is a flowchart of a method for processing a fault of a resource according to an embodiment of the present invention
  • FIG. 2 is a structural block diagram of a fault processing apparatus for a resource according to an embodiment of the present invention
  • FIG. 3 is a block diagram showing another structure of a fault processing apparatus for resources according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a resource protection group model in accordance with a preferred embodiment of the present invention.
  • FIG. 5 is a flowchart of fault processing of resources according to a preferred embodiment of the present invention.
  • FIG. 6 is a flow chart of a resource switchback of a preferred embodiment of the underlying invention.
  • FIG. 1 is a flowchart of a method for processing a fault of a resource according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:
  • Step S102 monitoring whether the specified resource of the node in the network storage cluster system is faulty, wherein the specified resource is a resource corresponding to the specified resource type in the pre-divided resource type in the network storage cluster system;
  • Step S104 When the specified resource fails, select a target object that takes over the specified resource according to a preset policy.
  • the resource failure on the node belongs to a partial failure, but the node is still isolated, and the takeover process of the node is transferred to other takeover nodes, resulting in a complicated takeover process, easy error, and increased load on the takeover node.
  • the problem is that the takeover process is simplified, the error rate is reduced, and the load burden of the takeover node is also reduced.
  • the takeover node only takes over part of the problematic resource, and the node where the fault is located There is no isolation. It is necessary to avoid multi-end loading of resources, ensure the consistency of services, and provide services continuously.
  • step S102 may be implemented in multiple manners.
  • the following technical solutions may be adopted: performing resource type division on resources of all nodes in the network storage cluster system;
  • a resource with the same resource type in the node is configured as a service group.
  • By detecting the status of the specified resource in the service group it is determined whether the specified resource is faulty, that is, the resources of the same resource type in all nodes in the network storage cluster system are logically divided.
  • For a service group it is detected that the resources in the service group having the same resource type are faulty. Because a service group corresponds to a resource type, the fault type can be detected conveniently and quickly, and the resources are conveniently managed.
  • the physical network port of the specified resource has two states: the active state (ACTIVE) and the standby state (STANDBY).
  • ACTIVE active state
  • STANDBY standby state
  • the foregoing step S104 may be implemented by: selecting a service unit that takes over the specified resource in the service group where the specified resource is located; and using the node where the service unit is located as the target
  • the service unit corresponding to the resource of the same resource type may be found in the same service group, after determining the service unit.
  • the node where the service unit is located is the above target object (which can also be understood as a takeover node).
  • the embodiment of the present invention further provides the following technical solution: after the target object takes over the execution resource, the switching information of the specified fault is saved, wherein The switching information includes at least one of the following: the original node information of the specified resource, and the resource type corresponding to the specified resource; when the original node where the specified resource is located is faulty, the specified resource is switched back to the original node according to the switching information. .
  • the embodiments of the present invention provide a high availability mechanism of a network attached storage cluster, which solves some problems of data loss, network load, and resource multi-end loading of the network attached storage node.
  • Service instance A basic unit that protects resources (which can be understood as resources in the above service group).
  • the network attached storage cluster it corresponds to a collection of network virtual network ports and virtual disk objects.
  • the virtual network port is an abstraction of the aggregation of several physical network ports that provide network connections. It is unique within the entire cluster.
  • the virtual network port is bound to the physical network port in the ACTIVE state.
  • the physical network port carries all services on the external virtual network port.
  • the configuration object is used to select the target object from the STANDBY state protection resource set to take over, ensuring that the virtual network port does not interrupt the external service.
  • Service unit A fully functional entity deployed on each node in the cluster to assume the assignment of service instances.
  • Each node in the storage cluster system consists of two service instances consisting of a front-end network port and a back-end virtual disk object. It is assumed that there are N nodes in the current network attached storage cluster system, and one service unit can only undertake N copies of ACTIVE. Service instance assignment, N service assignments of STANDBY.
  • Service group A collection of the same resource type objects on one or more service units.
  • the specific objects in multiple service groups form a service unit.
  • the set of all the physical network ports that carry the virtual network port service constitutes the service group of the virtual network port.
  • the primary and backup policies of each service group are completely independent of each other and do not affect each other.
  • Each service group has its own unique identifier, which is specified at creation time and is unique within the network attached storage cluster system.
  • the storage front and back virtual resources are specified at the time of creation. The same virtual resource can only belong to one node.
  • the service unit object on the home node is preferentially selected as the service instance assignment of ACTIVE.
  • Configuration policy The virtual resources at the front and the end are specified at the time of creation.
  • the service unit object is selected according to the policy.
  • the service unit corresponding to the IP address with a smaller IP address is taken over by default.
  • the interface is provided to support manual intervention. Configure different weights for the service unit object, and take precedence over the failed resource with a large weight.
  • the takeover node (which can be understood as the target object of the foregoing embodiment): when the current backend resource ACTIVE service unit is abnormal, the election is initiated from the STANDBY node according to the configuration policy, and a new ACTIVE service unit object is generated, and the node of the service unit object is called To take over the node.
  • the main decision node the node where the ACTIVE service instance is elected when the fault management module is powered on. When the fault management module itself generates an exception, the election is re-initiated, and a new ACTIVE fault management service instance assignment is generated. The node where the new service instance resides is The new main decision node.
  • the technical solution provided by the preferred embodiment of the present invention can be summarized as follows: by defining a protection resource model and a fault management framework, the network auxiliary storage front-end network and the back-end storage resources are managed to achieve high availability of the entire storage cluster resource.
  • the heartbeat monitoring is performed on some resource abnormalities in the protection resource. Once the monitoring module senses the abnormality, the alarm is notified to the fault management module. When the fault management module receives the alarm, the priority is determined according to the protection resource. The resources to be taken over are taken over and taken over to ensure the continuity of the external service; and the switching information of the abnormal resource is recorded at the same time;
  • the faulty module state is automatically synchronized to the protection resource group
  • the monitoring module senses the fault recovery, and performs a fault recovery request to the fault management module, and the fault management module performs a corresponding switchback according to the abnormal resource switching information. operating.
  • the resource protection group model may be roughly described as follows: a monitoring module resident in each node is responsible for heartbeat monitoring management, and is elected in the service group according to the configuration policy when abnormal.
  • the module is resident in each node in the form of a daemon thread.
  • the earliest power-on node is the main decision node. If multiple nodes are powered on at the same time, the node with smaller IP address value is elected as the main decision node by comparing IP.
  • the nodes communicate with each other through a Remote Procedure Call Protocol (RPC) message. Normally, the primary decision node initiates a heartbeat check, and collects service unit status information on other nodes according to the service group identifier.
  • RPC Remote Procedure Call Protocol
  • the other nodes are based on at least One event to determine whether to resend the beacon for a new election: 1. Whether the timed heartbeat check time exceeds the maximum check time; 2. Whether the service unit in the current ACTIVE state has an abnormality, and when one of the above conditions is met, Sites in all clusters send beacons to initiate elections for the ACTIVE service unit.
  • the fault management module of the main decision node that is elected by the fault management service identifier manages the front and back resources of the entire storage, and the work of the service instance is performed by the ACTIVE service unit in the front and rear resources, all services are carried on the service instance, and other services are The unit is in the STANDBY state of the service instance. After monitoring the abnormality of the ACTIVE service unit, the fault management module is responsible for the entire takeover collaboration.
  • the specific process collaboration is implemented by the following process:
  • Step 1 Configure a virtual network port and a virtual disk shared storage service group on each node.
  • the front-end virtual network service group is used for user storage network access, and the back-end virtual disk storage service group is used to store shared storage data resources.
  • Step 2 All the virtual resources are designated as the home node, and the configuration resource is registered into the resource service unit. Under normal circumstances, the virtual resource is actually running in the service unit on the home node, and the service unit is in an ACTIVE state;
  • Step 3 The monitoring module performs real-time heartbeat monitoring on all resource protection group resources, and sends an alarm when it finds that the running resources in the protection resource group are abnormal.
  • Step 4 The fault management module receives the abnormality, and the service unit resources in the service group that are currently running abnormally offline;
  • Step 5 According to the current node and the service group identifier, select the target takeover service unit object according to the configuration policy to migrate and record and save, and set the new service unit to the ACTIVE state;
  • Step 6 After the abnormal front-end resources are restored to normal, the resource service group is automatically updated, and the fault management module is notified;
  • Step 7 The fault management module switches back to the running resource based on the migration record at the time of the abnormality. Fault recovery while adjusting the status of two service unit objects.
  • the preferred embodiment of the present invention achieves the following technical effects: through the resource protection group model, the cluster nodes are refined according to the front-end network resources and the back-end storage resources, and in the scenario of abnormal parts of the nodes, only the abnormal parts of the node are taken over, and the reserved nodes are normal. Run the section.
  • the fault management module itself joins the protection resource group for hot standby, simplifies system implementation, and effectively solves the main decision.
  • the fault management module on the node is abnormal.
  • the hot standby host is deployed in the cluster system to make full use of the host's own computing power, improve the response speed of the takeover, and reduce the cost.
  • a fault processing device for a resource is provided, which is used to implement the foregoing embodiments and preferred embodiments, and details are not described herein.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 2 is a structural block diagram of a fault processing apparatus for a resource according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes:
  • the monitoring module 20 is configured to monitor whether a specified resource of the node in the network storage cluster system is faulty, where the specified resource is a resource corresponding to the specified resource type in the pre-divided resource type in the network storage cluster system;
  • the selection module 22 is connected to the monitoring module 20, and is configured to select a target object that takes over the specified resource according to a preset policy when the specified resource fails.
  • the resource failure on the node belongs to a partial failure, but the node is still isolated, and the takeover process of the node is transferred to other takeover nodes, which leads to a complicated takeover process, is prone to error, and also increases the takeover node.
  • the load problem simplifies the takeover process, reduces the error rate, and also reduces the load on the takeover node.
  • FIG. 3 is another structural block diagram of a fault processing apparatus for resources according to an embodiment of the present invention, as shown in FIG. 3:
  • the monitoring module 20 may include the following unit: the dividing unit 200 is configured to be configured to perform the above-mentioned function of monitoring whether the specified resource of the node in the network storage cluster system is faulty.
  • the resources of all the nodes in the network storage cluster system are divided into resource types;
  • the configuration unit 202 is connected to the dividing unit 200, and the dividing unit is configured to configure resources of the same resource type in all the nodes as one service group;
  • the configuration unit 202 is configured to determine whether the specified resource is faulty by detecting the status of the specified resource in the service group, where the determining unit 204 is configured to change the physical network port status of the specified resource from the running state to the standby state. In the state, it is determined that the above specified resource has failed.
  • the selecting module 22 may further include: a selecting unit 220, configured to select a service unit that takes over the specified resource in the service group where the specified resource is located; and the determining unit 222 is connected to the selecting unit 220, and is configured to be The node where the above service unit is located serves as the above target object.
  • a selecting unit 220 configured to select a service unit that takes over the specified resource in the service group where the specified resource is located
  • the determining unit 222 is connected to the selecting unit 220, and is configured to be The node where the above service unit is located serves as the above target object.
  • the target object in the selection module 22 can be understood as the takeover node of the above embodiment.
  • FIG. 4 is a schematic diagram of a resource protection group model according to a preferred embodiment of the present invention.
  • the virtual network port service instance is protected and executed by the virtual network port service group, and the virtual disk service instance is protected and executed by the virtual disk service group.
  • the solid arrow points to the ACTIVE service unit object, which actually carries the service
  • the dotted arrow points to the STANDBY service unit object
  • the new ACTIVE unit takes over the object when the exception occurs.
  • the service unit 3 is arranged to perform the ACTIVE work of the virtual network port service instance, and the service unit 1 and the service unit 2 perform the STANDBY work of the virtual network port service service instance.
  • the virtual line network port service instance and the virtual disk service instance in the virtual disk service instance and the service line in the service unit represent ACTIVE; the dotted line connection is STANDBY assignment.
  • the service unit 2 is arranged to perform the ACTIVE work of the virtual disk service instance, and the service unit 1 and the service unit 3 perform the STANDBY work of the virtual disk service instance.
  • FIG. 5 is a flowchart of fault processing of resources according to a preferred embodiment of the present invention, as shown in FIG. 5:
  • Step S502 The status of the resource protection node resource protection resource changes (triggered by a device fault or a human-machine command), changes from ACTIVE to STANDBY state, and notifies the monitoring agent module on the node;
  • Step S504 The primary decision node monitoring module communicates with each node monitoring agent through the timing heartbeat, and senses that the corresponding type of protection resource status is abnormal, and sends a switching request to the local node fault management module;
  • Step S506 The fault management module notifies the abnormal home node proxy module to take off the affected resources, perform the resource offline operation, and after the resource is cleaned, reply the resource offline response to the fault management module of the main decision node;
  • Step S508 The main decision node fault management module receives the resource offline response, and according to the configuration policy, elects the takeover node of the abnormal resource, and sends a resource online request to the takeover node proxy module;
  • Step S510 The target node proxy module receives the resource online request, and after performing the resource online operation to the service module, notifying the main decision node fault management module to reply to the resource online response;
  • Step S512 The main decision node fault management module receives the resource online response, and considers that the handover is completed, and returns a handover response to the monitoring module of the node, and the process ends.
  • Figure 6 is a flow chart of resource switching back according to a preferred embodiment of the essential invention, as shown in Figure 6:
  • Step S602 The state of the resource protection node resource protection resource changes (triggered by device failure recovery or human machine command) to change from STANDBY to ACTIVE state, and notify the monitoring agent module on the node;
  • Step S604 The main decision node monitoring module communicates with each node monitoring agent through the timing heartbeat, and perceives that the corresponding type of active protection resource state is restored, and sends a switching request to the local node fault management module;
  • Step S606 The fault management module notifies the takeover node proxy module to go offline, and after the resource is cleaned, returns a resource offline response to the fault control module of the main decision node;
  • Step S608 The main decision node fault management module receives the resource offline response, and sends a resource online request to the original home node proxy module.
  • Step S610 The resource home node proxy module receives the resource online request, and after performing the resource online operation to the service module, returns a resource online response to the main decision node fault management module.
  • Step S612 The main decision node fault management module receives the resource online response, and considers that the handover is completed, and returns a switchback response to the monitoring module of the node, and the process ends.
  • a storage medium is further provided, wherein the software includes the above-mentioned software, including but not limited to: an optical disk, a floppy disk, a hard disk, an erasable memory, and the like.
  • the embodiment of the present invention achieves the following technical effects: the takeover process is simplified, the error rate is reduced, and the load burden of the takeover node is also reduced, that is, the technical solution of the embodiment of the present invention is adopted: take over The node only takes over some of the resources in question. Because the node where the fault is located is not isolated, it is necessary to avoid multi-end loading of resources, ensure the consistency of services, and provide services continuously.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, thereby Storing them in a storage device is performed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that herein, or separately fabricated into individual integrated circuit modules, or Multiple of these modules or steps are fabricated as a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.
  • the resource on the node after the resource on the node is classified, when the designated resource fails, only the technical solution that the failed resource is transferred to the other node may be solved, and the related technology is solved.
  • the resource failure on the node belongs to a partial failure, but the node is still isolated, and the node's service is transferred to other takeover nodes.
  • the takeover process is complicated, error-prone, and also increases the load of the takeover node. It simplifies the takeover process, reduces the error rate, and reduces the load on the takeover node.

Abstract

本发明提供了一种资源的故障处理方法及装置,其中,上述故障处理方法包括:监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。采用本发明提供的上述技术方案,解决了相关技术中由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担。

Description

资源的故障处理方法及装置 技术领域
本发明涉及通信领域,具体而言,涉及一种资源的故障处理方法及装置。
背景技术
网络附属存储系统广泛用于企业管理平台,其性能的安全可靠性可以直接关系到企业日常运营,因此网络附属存储系统需要保证稳定以及较高的可用性。
根据Gartner公司所作的统计,导致系统异常运行的原因主要可以主要分为以下几个方面:应用问题(40%)、操作问题(40%)、操作系统故障(10%)和硬件故障(10%),对于网络附属存储集群系统来说,很多情况也有可能是前端某个接入网口、后端某个存储资源的软硬件资源出现异常。在这种场景下,该节点上除了发生异常的模块不能运行之外,其它的模块都正常运行,此时现有技术中采用的技术方案是将整个节点隔离,把业务转移到其它能够正常运行的节点上去,而上述技术方案会使整个接管流程复杂,出错的概率也相应增加,同时整个接管耗时较长,接管成功后接管节点的负载也相应增加,给整个存储业务的过程都带来压力。
此外,当前网络存储集群中,故障管理模块主要是管理本节点上的存储资源,模块本身异常处理是通过节点的重新选举,产生新的接管节点来实现。选举算法以Paxos算法最为出名,在多个开源项目中使用到的,但是基本节点对象的单实例选举,无法解决节点内多个具体对象资源的选举。
针对相关技术中,由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,尚未提出有效的解决方案。
发明内容
为了解决上述技术问题,本发明提供了一种资源的故障处理方法及装置。
根据本发明的一个实施例,提供了一种资源的故障处理方法,包括:监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。
优选地,监测网络存储集群系统中节点的指定资源是否发生故障包括:对所述网络存储集群系统中所有节点的资源进行资源类型的划分;将所述所有节点中资源类型相同的资源配置为一个服务组;通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。
优选地,在以下情况下确定所述指定资源发生故障:当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。
优选地,按照预设策略选择接管所述指定资源的目标对象,包括:在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;将所述服务单元所在的节点作为所述目标对象。
优选地,通过以下之一方式在所述资源所在的服务组中选择接管所述指定资源的服务单元:按照预设的优先级从所述服务组中选择所述服务单元;按照所述服务组中所述服务单元的IP地址取值选择所述服务单元。
优选地,在所述目标接管对象对所述发生故障的指定资源进行接管后,还包括:保存所述指定资源的切换信息,其中,所述切换信息包括以下至少之一:所述指定资源所在的原节点信息、所述指定资源对应的资源类型;当所述指定资源所在的原节点故障恢复时,根据所述切换信息将所述指定资源切换回所述原节点。
根据本发明实施例的另一个实施例,还提供了一种资源的故障处理装置,包括:监测模块,设置为监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;选择模块,设置为在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。
优选地,所述监测模块包括:划分单元,设置为对所述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元,设置为将所述所有节点中资源类型相同的资源配置为一个服务组;判断单元,设置为通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。
优选地,所述判断单元设置为当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。
优选地,所述选择模块,包括:选择单元,设置为在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;确定单元,设置为将所述服务单元所在的节点作为所述服务单元。
通过本发明,采用对节点上的资源进行分类后,当指定资源发生故障时,可以仅将发生故障的资源转移到其他节点上的技术方案,解决了相关技术中由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的资源的故障处理方法的流程图;
图2是根据本发明实施例的资源的故障处理装置的结构框图;
图3是根据本发明实施例的资源的故障处理装置的另一结构框图;
图4为根据本发明优选实施例的资源保护组模型示意图;
图5为根据本发明优选实施例的资源的故障处理流程图;
图6为根本发明优选实施例的资源切回流程图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
在本实施例中提供了一种资源的故障处理方法,图1是根据本发明实施例的资源的故障处理方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;
步骤S104,在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。
通过上述各个步骤,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。
可选地,上述步骤S102可以有多种实现方式,在本发明实施例的一个示例中,可以采用如下技术方案实现:对网络存储集群系统中所有节点的资源进行资源类型的划分;将上述所有节点中资源类型相同的资源配置为一个服务组;通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,即将网络存储集群系统中所有节点中相同资源类型的资源都逻辑上划分为一个服务组内,检测上述具有同一资源类型的服务组中的资源是否发生故障,由于一个服务组中对应的是一个资源类型,可以方便快捷的检测出故障类型,且便于对资源进行管理。
由于所有物理网口均有两种状态:运行态(ACTIVE)和备用态(STANDBY),当指定资源的物理网口状态从运行态转为备用态时,可以判定述指定资源发生故障。
在本发明实施例的另一个可选实施例中,上述步骤S104可以通过以下方式实现:在上述指定资源所在的服务组中选择接管指定资源的服务单元;将上述服务单元所在的节点作为上述目标对象,在监测服务组内的资源时,当监测到指定资源发生故障时,可以在同一服务组内查找与发生故障的指定资源为同一资源类型的资源所对应的服务单元,在确定服务单元后,该服务单元所在的节点即为上述目标对象(也可以理解为是接管节点)。
为了保证系统中节点业务的一致性,在目标对象对指定资源进行接管后,本发明实施例还提供了以下技术方案:在目标对象接管执行资源后,保存上述指定故障的切换信息,其中,上述切换信息包括以下至少之一:上述指定资源所在的原节点信息、上述指定资源对应的资源类型;当上述指定资源所在的原节点故障恢复时,根据上述切换信息将上述指定资源切换回上述原节点。
综上所述,本发明实施例提供了一种网络附属存储集群高可用机制,解决了目前网络附属存储运行节点数据丢失、网络负载高、资源多端加载等部分故障问题。
为了更好的理解上述资源的故障处理过程,以下结合一个优选实施例进行说明,但不限定本发明实施例。
首先,对本发明优选实施例中涉及到的名词简单解释如下:
服务实例:保护资源(可以理解为是上述服务组中的资源)的基本单位,在网络附属存储集群中,对应网络虚拟网口和虚拟盘对象的集合。以虚拟网口为例说明,虚拟网口是对当前提供网络连接的若干物理网口聚合的抽象,在整个集群范围内具有唯一性。虚拟网口绑定在ACTIVE状态的物理网口上,该物理网口承载对外虚拟网口上的所有业务。当ACTIVE状态物理网口出现异常时,通过配置策略从STANDBY态保护资源集合中选举出目标对象进行接管,保证虚拟网口对外业务的不中断。
服务单元:一个具备完整功能的个体,在集群中各节点上部署,可承担服务实例的指派。存储集群系统中每个节点上包含前端网口和后端虚拟盘对象两个服务实例组成的服务单元,假定当前网络附属存储集群系统中有N个节点,一个服务单元只能承担N份ACTIVE的服务实例指派,N份STANDBY的服务实例指派。
服务组:由一个或多个服务单元上同种资源类型对象组成的集合,多个服务组中具体对象组成服务单元。以虚拟网口为例说明,承载虚拟网口业务的所有物理网口集合组成虚拟网口的服务组。每个服务组有的主备策略,服务组之间完全独立,互不影响。每个服务组有其唯一标识,该标识在创建时指定,且在网络附属存储集群系统范围内唯一。
归属节点:存储前后端虚拟资源在创建时指定,同一个虚拟资源只能归属于一个节点,上电时优先选择归属节点上服务单元对象作为ACTIVE的服务实例指派。
配置策略:前后端虚拟资源在创建时指定,资源异常时根据该策略选择服务单元对象进行接管,默认按IP地址取值比较小的IP地址对应的服务单元优先接管,同时,提供接口支持人工干预,对服务单元对象配置不同权值,取权值大的优先接管发生故障的资源。
接管节点(可以理解为上述实施例的目标对象):当前后端资源ACTIVE服务单元出现异常时,根据配置策略从STANDBY节点中发起选举,产生新的ACTIVE服务单元对象,该服务单元对象所在节点称为接管节点。
主决策节点:故障管理模块上电时选举产生的ACTIVE服务实例所在节点,当故障管理模块本身产生异常时,会重新发起选举,从而产生新的ACTIVE故障管理服务实例指派,新服务实例所在节点为新的主决策节点。
本发明优选实施例提供的技术方案可以大致总结为:通过定义保护资源模型和故障管理框架,管理网络附属存储前端网络和后端存储资源,达到整个存储集群资源的高可用。
当前后端部分资源出现异常时,对保护资源中部分资源异常进行心跳监控,一旦监控模块感知到异常后,告警通知故障管理模块;当故障管理模块接收到告警后,按照保护资源接管优先级决策需要接管的资源并进行接管,保证对外服务的连续性;同时记录该异常资源的切换信息;
可选地,当故障解除后,故障模块状态自动同步到保护资源组中,监控模块感知该故障恢复,向故障管理模块执行故障恢复请求,故障管理模块根据异常资源的切换信息执行相应的切回操作。
在本发明实施例上述提供的技术方案中:资源保护组模型可以大致描述如下:每个节点上常驻一个监控模块,负责心跳监控管理、异常时在服务组内根据配置策略选举。该模块以守护线程形式常驻各节点,最早上电的节点为主决策节点,如果同时上电多节点,通过比较IP,选举较小IP地址值的节点为主决策节点。节点间通过远程过程调用协议(Remote Procedure Call protocol,简称为RPC)消息进行通信,正常情况下由主决策节点发起心跳检查,按服务组标识收集其它节点上服务单元状态信息,其它节点根据以下至少之一事件来判断决定是否重新发送信标进行新的选举:1.定时心跳检查时间是否超过了最大检查时间;2.当前ACTIVE状态的服务单元是否出现异常,在满足上述条件之一时,会向所有集群中的站点发送信标,发起ACTIVE服务单元的选举。
通过故障管理服务标识选举出的主决策节点故障管理模块管理整个存储的前后端资源,前后端资源中由ACTIVE服务单元来执行该服务实例的工作,所有业务承载于该服务实例上,其它各个服务单元处于该服务实例的STANDBY状态,在监控到ACTIVE服务单元异常后,该故障管理模块负责整个接管协作,具体流程协作通过以下过程实现:
步骤1:在各节点配置虚拟网口和虚拟盘共享存储服务组,前端虚拟网络服务组用于用户存储网络接入,后端虚拟盘存储服务组用于存放共享存储数据资源;
步骤2:将所有虚拟资源指定归属节点,注册配置资源进资源服务单元,正常情况下,虚拟资源真实运行于归属节点上的服务单元中,该服务单元为ACTIVE状态;
步骤3:监控模块对所有资源保护组资源进行实时心跳监控,一旦发现保护资源组内运行资源出现异常则发出告警;
步骤4:故障管理模块接收到异常,下线当前运行异常的服务组内服务单元资源;
步骤5:根据当前节点和服务组标识,根据配置策略选取出目标接管服务单元对象进行迁移并记录保存,设置新的服务单元为ACTIVE状态;
步骤6:当出现异常的前后端资源恢复正常之后,将自动更新资源服务组,并通知故障管理模块;
步骤7:故障管理模块根据异常时的迁移记录,切换回其上的运行资源。故障恢复,同时调整两个服务单元对象的状态。
本发明优选实施例达到了以下技术效果:通过资源保护组模型,将集群节点按前端网络资源、后端存储资源进行细化,节点部分资源异常场景下,支持只接管节点异常部分,保留节点正常运行部分。从而提高了整体性能,实现网络附属存储群集资源的有效利用;满足关键业务高可用性、稳定性和扩展性的要求,可用于高可用存储集群多机热备要求的故障检测、接管决策、故障隔离与切换、恢复与扩展;通过对Paxos算法进行改进,按节点和服务组标识支持多实例选举,提高选举灵活性,故障管理模块本身加入保护资源组进行热备,简化系统实现,有效解决主决策节点上故障管理模块本身异常问题;在集群系统内部署热备主机,充分利用主机自身运算能力,提升接管响应速度,降低成本开支。
在本实施例中还提供了一种资源的故障处理装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图2是根据本发明实施例的资源的故障处理装置的结构框图,如图2所示,该装置包括:
监测模块20,设置为监测网络存储集群系统中节点的指定资源是否发生故障,其中,上述指定资源为上述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;
选择模块22,与监测模块20连接,设置为在上述指定资源发生故障时,按照预设策略选择接管上述指定资源的目标对象。
通过上述各个模块的综合作用,采用对节点上的资源进行分类后,当分类后的其中一个类型的指定资源发生故障时,可以仅将发生故障的指定资源转移到其他节点上的技术方案,解决了相关技术中很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担
图3是根据本发明实施例的资源的故障处理装置的另一结构框图,如图3所示:
监测模块20为了实现上述监测网络存储集群系统中节点的指定资源是否发生故障的功能,在本发明实施例的一个可选实施例中,监测模块20可以包括如下单元:划分单元200,设置为对上述网络存储集群系统中所有节点的资源进行资源类型的划分;配置单元202,与划分单元200连接,与划分单元设置为将上述所有节点中资源类型相同的资源配置为一个服务组;判断单元204,与配置单元202连接,设置为通过检测上述服务组中上述指定资源的状态判断上述指定资源是否发生故障,其中,判断单元204设置为当上述指定资源的物理网口状态由运行态转为备用态时,确定上述指定资源发生故障。
可选地,选择模块22还可以包括如下单元:选择单元220,设置为在上述指定资源所在的服务组中选择接管上述指定资源的服务单元;确定单元222,与选择单元220连接,设置为将上述服务单元所在的节点作为上述目标对象。
在本发明实施例中,选择模块22中的目标对象可以理解为上述实施例的接管节点。
结合以下优选实施例对本发明实施例的技术方案进一步详细阐述:
图4为根据本发明优选实施例的资源保护组模型示意图,如图4所示,有两个服务组:虚拟网口服务组和虚拟盘服务组,有两个服务实例:虚拟网口服务实例和虚拟盘服务实例。虚拟网口服务实例由虚拟网口服务组来保护执行,虚拟盘服务实例由虚拟盘服务组来保护执行。其中,实线箭头指向ACTIVE服务单元对象,实际上承载业务,虚线箭头指向STANDBY服务单元对象,异常时指派出新ACTIVE单元接管对象。
由图4所提供的示意图可以知晓:虚拟网口服务组内,安排服务单元3执行虚拟网口服务实例的ACTIVE工作,服务单元1和服务单元2执行虚拟网口服务服务实例的STANDBY工作,图4中虚拟盘网口服务实例和虚拟盘服务实例与服务单元中的连线实线代表的是ACTIVE;虚线连接为STANDBY指派。
虚拟盘服务组内,安排服务单元2执行虚拟盘服务实例的ACTIVE工作,服务单元1和服务单元3执行虚拟盘服务实例的STANDBY工作。
图5为根据本发明优选实施例的资源的故障处理流程图,如图5所示:
在节点的部分资源异常场景中,资源故障触发的整个接管流程:
步骤S502:资源归属节点业务保护资源状态发生变化(由设备故障或者人机命令触发),从ACTIVE转变为STANDBY状态,通知本节点上监控代理模块;
步骤S504:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的保护资源状态异常,向本节点故障管理模块发送切换请求;
步骤S506:故障管理模块通知异常归属节点代理模块将受影响的资源下线,执行资源下线操作,进行资源清理后向主决策节点故障管理模块回复资源下线响应;
步骤S508:主决策节点故障管理模块收到资源下线响应,根据配置策略,选举出该异常资源的接管节点,并向接管节点代理模块发送资源上线请求;
步骤S510:目标节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,通知主决策节点故障管理模块,回复资源上线响应;
步骤S512:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切换响应,流程结束。
图6为根本发明优选实施例的资源切回流程图,如图6所示:
在节点的部分资源异常恢复场景中,资源故障恢复触发的整个切回流程:
步骤S602:资源归属节点业务保护资源状态发生变化(由设备故障恢复或者人机命令触发)从STANDBY转变为ACTIVE状态,通知本节点上监控代理模块;
步骤S604:主决策节点监控模块通过定时心跳与各节点监控代理通信,感知到对应类型的活动保护资源状态恢复,向本节点故障管理模块发送切换请求;
步骤S606:故障管理模块通知接管节点代理模块将资源下线,进行资源清理后向主决策节点故障管理模块回复资源下线响应;
步骤S608:主决策节点故障管理模块收到资源下线响应,向原归属节点代理模块发送资源上线请求;
步骤S610:资源归属节点代理模块收到资源上线请求,向业务模块执行资源上线操作后,向主决策节点故障管理模块回复资源上线响应;
步骤S612:主决策节点故障管理模块收到资源上线响应,认为切换完成,向本节点监控模块回复切回响应,流程结束。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。
在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。
在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的对象在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
综上所述,本发明实施例达到了以下技术效果:简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担,也就是说,采用本发明实施例的技术方案:接管节点只接管有问题的部分资源,由于故障所在节点没有隔离,要避免资源出现多端加载,保证业务的一致性,持续对外提供服务。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以 将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
基于本发明实施例提供的上述技术方案,采用对节点上的资源进行分类后,当指定资源发生故障时,可以仅将发生故障的资源转移到其他节点上的技术方案,解决了相关技术中由于很多情况下节点上的资源故障都属于部分故障,但仍然将该节点隔离,将节点的业务转移到其他接管节点上而导致的接管流程复杂,容易出错,同时也增加了接管节点的负载的问题,简化了接管流程,降低了出错率,同时也较少了接管节点的负载负担。

Claims (10)

  1. 一种资源的故障处理方法,包括:
    监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;
    在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。
  2. 根据权利要求1所述的方法,其中,监测网络存储集群系统中节点的指定资源是否发生故障包括:
    对所述网络存储集群系统中所有节点的资源进行资源类型的划分;
    将所述所有节点中资源类型相同的资源配置为一个服务组;
    通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。
  3. 根据权利要求2所述的方法,其中,在以下情况下确定所述指定资源发生故障:
    当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。
  4. 根据权利要求2所述的方法,其中,按照预设策略选择接管所述指定资源的目标对象,包括:
    在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;
    将所述服务单元所在的节点作为所述目标对象。
  5. 根据权利要求4所述的方法,其中,通过以下之一方式在所述资源所在的服务组中选择接管所述指定资源的服务单元:
    按照预设的优先级从所述服务组中选择所述服务单元;
    按照所述服务组中所述服务单元的IP地址取值选择所述服务单元。
  6. 根据权利要求1至5任一项所述的方法,其中,在所述目标对象对所述发生故障的指定资源进行接管后,还包括:
    保存所述指定资源的切换信息,其中,所述切换信息包括以下至少之一:所述指定资源所在的原节点信息、所述指定资源对应的资源类型;
    当所述指定资源所在的原节点故障恢复时,根据所述切换信息将所述指定资源切换回所述原节点。
  7. 一种资源的故障处理装置,包括:
    监测模块,设置为监测网络存储集群系统中节点的指定资源是否发生故障,其中,所述指定资源为所述网络存储集群系统中预先划分的资源类型中指定资源类型所对应的资源;
    选择模块,设置为在所述指定资源发生故障时,按照预设策略选择接管所述指定资源的目标对象。
  8. 根据权利要求7所述的装置,其中,所述监测模块包括:
    划分单元,设置为对所述网络存储集群系统中所有节点的资源进行资源类型的划分;
    配置单元,设置为将所述所有节点中资源类型相同的资源配置为一个服务组;
    判断单元,设置为通过检测所述服务组中所述指定资源的状态判断所述指定资源是否发生故障。
  9. 根据权利要求8所述的装置,其中,所述判断单元设置为当所述指定资源的物理网口状态由运行态转为备用态时,确定所述指定资源发生故障。
  10. 根据权利要求8所述的装置,其中,所述选择模块,包括:
    选择单元,设置为在所述指定资源所在的服务组中选择接管所述指定资源的服务单元;
    确定单元,设置为将所述服务单元所在的节点作为所述目标对象。
PCT/CN2015/072923 2014-10-15 2015-02-12 资源的故障处理方法及装置 WO2016058307A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410545516.4 2014-10-15
CN201410545516.4A CN105515812A (zh) 2014-10-15 2014-10-15 资源的故障处理方法及装置

Publications (1)

Publication Number Publication Date
WO2016058307A1 true WO2016058307A1 (zh) 2016-04-21

Family

ID=55723475

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/072923 WO2016058307A1 (zh) 2014-10-15 2015-02-12 资源的故障处理方法及装置

Country Status (2)

Country Link
CN (1) CN105515812A (zh)
WO (1) WO2016058307A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111176783A (zh) * 2019-11-20 2020-05-19 航天信息股份有限公司 容器治理平台的高可用方法、装置及电子设备
CN111200518A (zh) * 2019-12-25 2020-05-26 曙光信息产业(北京)有限公司 一种基于paxos算法的去中心化HPC计算集群管理方法及系统
CN111552556A (zh) * 2020-03-24 2020-08-18 合肥中科类脑智能技术有限公司 一种gpu集群服务管理系统及方法
CN111628958A (zh) * 2019-07-12 2020-09-04 国铁吉讯科技有限公司 基于线性组网的网络访问方法、装置和系统
CN111865682A (zh) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 用于处理故障的方法和装置
CN112104727A (zh) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及系统
CN114157585A (zh) * 2021-12-09 2022-03-08 京东科技信息技术有限公司 一种业务资源监测的方法和装置
CN114745557A (zh) * 2022-03-22 2022-07-12 浙江大华技术股份有限公司 容灾操作的执行方法和装置、存储介质及电子装置
CN115134219A (zh) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 设备资源管理方法及装置、计算设备和存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107276849A (zh) * 2017-06-15 2017-10-20 北京奇艺世纪科技有限公司 一种集群的性能分析方法及装置
CN108289034B (zh) * 2017-06-21 2019-04-09 新华三大数据技术有限公司 一种故障发现方法和装置
CN107247564B (zh) * 2017-07-17 2021-02-02 苏州浪潮智能科技有限公司 一种数据处理的方法及系统
CN111984463A (zh) * 2020-07-03 2020-11-24 浙江华云信息科技有限公司 一种基于边缘计算系统的微应用管理方法及装置
CN112306813B (zh) * 2020-11-13 2023-03-14 苏州浪潮智能科技有限公司 一种系统告警方法及装置
CN112463535A (zh) * 2020-11-27 2021-03-09 中国工商银行股份有限公司 多集群异常处理方法及装置
CN114039836A (zh) * 2021-11-05 2022-02-11 光大科技有限公司 Exporter采集器的故障处理方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102239665A (zh) * 2010-12-13 2011-11-09 华为技术有限公司 管理业务的方法及装置
CN103167004A (zh) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 云平台主机系统故障修复方法及云平台前端控制服务器
CN103617006A (zh) * 2013-11-28 2014-03-05 曙光信息产业股份有限公司 存储资源的管理方法与装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6654914B1 (en) * 1999-05-28 2003-11-25 Teradyne, Inc. Network fault isolation
US7577090B2 (en) * 2004-02-13 2009-08-18 Alcatel-Lucent Usa Inc. Method and system for providing availability and reliability for a telecommunication network entity
US7428214B2 (en) * 2004-03-04 2008-09-23 Cisco Technology, Inc. Methods and devices for high network availability
CN201039274Y (zh) * 2007-02-09 2008-03-19 宋景明 模块化插板式多功能VoIP网关
CN101369241A (zh) * 2007-09-21 2009-02-18 中国科学院计算技术研究所 一种机群容错系统、装置及方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102239665A (zh) * 2010-12-13 2011-11-09 华为技术有限公司 管理业务的方法及装置
CN103167004A (zh) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 云平台主机系统故障修复方法及云平台前端控制服务器
CN103617006A (zh) * 2013-11-28 2014-03-05 曙光信息产业股份有限公司 存储资源的管理方法与装置

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111628958B (zh) * 2019-07-12 2022-08-05 国铁吉讯科技有限公司 基于线性组网的网络访问方法、装置和系统
CN111628958A (zh) * 2019-07-12 2020-09-04 国铁吉讯科技有限公司 基于线性组网的网络访问方法、装置和系统
CN111176783A (zh) * 2019-11-20 2020-05-19 航天信息股份有限公司 容器治理平台的高可用方法、装置及电子设备
CN111200518A (zh) * 2019-12-25 2020-05-26 曙光信息产业(北京)有限公司 一种基于paxos算法的去中心化HPC计算集群管理方法及系统
CN111200518B (zh) * 2019-12-25 2022-10-18 曙光信息产业(北京)有限公司 一种基于paxos算法的去中心化HPC计算集群管理方法及系统
CN111552556A (zh) * 2020-03-24 2020-08-18 合肥中科类脑智能技术有限公司 一种gpu集群服务管理系统及方法
CN111552556B (zh) * 2020-03-24 2023-06-09 北京中科云脑智能技术有限公司 一种gpu集群服务管理系统及方法
CN111865682A (zh) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 用于处理故障的方法和装置
CN111865682B (zh) * 2020-07-16 2023-08-08 北京百度网讯科技有限公司 用于处理故障的方法和装置
CN112104727B (zh) * 2020-09-10 2021-11-30 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及系统
CN112104727A (zh) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及系统
CN114157585A (zh) * 2021-12-09 2022-03-08 京东科技信息技术有限公司 一种业务资源监测的方法和装置
CN114745557A (zh) * 2022-03-22 2022-07-12 浙江大华技术股份有限公司 容灾操作的执行方法和装置、存储介质及电子装置
CN115134219A (zh) * 2022-06-29 2022-09-30 北京飞讯数码科技有限公司 设备资源管理方法及装置、计算设备和存储介质

Also Published As

Publication number Publication date
CN105515812A (zh) 2016-04-20

Similar Documents

Publication Publication Date Title
WO2016058307A1 (zh) 资源的故障处理方法及装置
US11307943B2 (en) Disaster recovery deployment method, apparatus, and system
JP6835444B2 (ja) ソフトウェア定義型データセンター、並びにそのためのサービスクラスタスケジューリング方法及びトラフィック監視方法
CN110224871B (zh) 一种Redis集群的高可用方法及装置
US11416359B2 (en) Hot standby method, apparatus, and system
CN100387017C (zh) 构建多机系统高可用的自愈合逻辑环故障检测与容忍方法
CN106664216B (zh) 一种切换vnf的方法和装置
JP2015103092A (ja) 障害回復システム及び障害回復システムの構築方法
CN103346903A (zh) 一种双机备份的方法和装置
US10331472B2 (en) Virtual machine service availability
CN104158707A (zh) 一种检测并处理集群脑裂的方法和装置
WO2021185169A1 (zh) 一种切换方法、装置、设备和存储介质
CN111935244B (zh) 一种业务请求处理系统及超融合一体机
WO2006005251A1 (fr) Procede et systeme de realisation de la fonction de commutation dans un systeme de communication
KR20150124642A (ko) 병렬 연결식 서버시스템의 통신 장애 복구방법
JP7206981B2 (ja) クラスタシステム、その制御方法、サーバ、及びプログラム
CN103297279A (zh) 一种多软件进程系统上软件控制的主备单盘倒换方法
CN105490847A (zh) 一种私有云存储系统中节点故障实时检测及处理方法
US11418382B2 (en) Method of cooperative active-standby failover between logical routers based on health of attached services
JP5285044B2 (ja) クラスタシステム復旧方法及びサーバ及びプログラム
JP2012014674A (ja) 仮想環境における故障復旧方法及びサーバ及びプログラム
CN110677288A (zh) 一种通用于多场景部署的边缘计算系统及方法
CN114124803B (zh) 设备管理方法、装置、电子设备及存储介质
US10516625B2 (en) Network entities on ring networks
CN114268581B (zh) 一种实现网络设备高可用和负载分担的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15850524

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15850524

Country of ref document: EP

Kind code of ref document: A1