WO2017097130A1 - 一种分布式存储系统的服务节点切换方法和装置 - Google Patents

一种分布式存储系统的服务节点切换方法和装置 Download PDF

Info

Publication number
WO2017097130A1
WO2017097130A1 PCT/CN2016/107422 CN2016107422W WO2017097130A1 WO 2017097130 A1 WO2017097130 A1 WO 2017097130A1 CN 2016107422 W CN2016107422 W CN 2016107422W WO 2017097130 A1 WO2017097130 A1 WO 2017097130A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
response
service
service node
abnormal
Prior art date
Application number
PCT/CN2016/107422
Other languages
English (en)
French (fr)
Inventor
姚文辉
刘俊峰
黄硕
张海勇
朱家稷
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to US15/776,938 priority Critical patent/US10862740B2/en
Publication of WO2017097130A1 publication Critical patent/WO2017097130A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1034Reaction to server failures by a load balancer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a service node switching method of a distributed storage system and a service node switching apparatus of a distributed storage system.
  • a centralized metadata management method is mainly adopted, that is, metadata of all data in the entire system is centralized in several metadata service nodes for storage.
  • the availability of metadata service nodes in such an architecture is directly related to the availability of the entire system, so the availability of metadata service nodes is typically increased in a distributed manner in a distributed storage system.
  • there are two main methods for improving the availability of metadata service nodes such as high-availability (high-availability) by means of a metadata service (Name Node), and switching of the current service of an abnormal state by using an alternate service node (Slave node).
  • the node Primary node
  • the switching of the service node using the Paxos protocol in, for example, the Facebook Cloud Distributed System and the Pangu File Storage System.
  • the service node is triggered only when the current service node cannot send a heartbeat confirmation to the standby serving node due to server downtime, service process restart, network disconnection, etc.
  • Switching processing in other abnormal situations such as single-way disconnection of the duplex network, abnormal part of the network protocol, slow disk response, etc., due to the lock maintenance mechanism and the heartbeat mechanism, the standby service node still considers the current service node to be in normal operation. Status without triggering the switching process of the service node.
  • the current service node that is in an abnormal state may cause the service request timeout of the responding user to time out, fail to provide complete metadata, and cannot store the log on the shared storage device.
  • the current service node has been affected to the user.
  • the quality of service provided but the current service node switching method cannot restore normal and stable metadata services accordingly. Therefore, the current service node switching method has a low efficiency of metadata service recovery, which affects the user experience.
  • embodiments of the present application have been made in order to provide an overcoming of the above problems or at least partially A service node switching method for a distributed storage system and a corresponding service node switching device for a distributed storage system.
  • the present application discloses a service node switching method of a distributed storage system, where the service node includes a current serving node and a standby serving node, and the method includes:
  • the response status of the current serving node is abnormal, the communication between the current serving node and the standby serving node is stopped, and the switching process of the current serving node is triggered.
  • the method further includes:
  • an abnormal node identifier for marking the handover process that does not participate in the current serving node is added to the service node.
  • the step of triggering the handover process of the current serving node includes:
  • Triggering to select at least one service node that does not carry the abnormal node identifier as a new current service node, and replace the current service node with an abnormal response status.
  • the step of monitoring the response status of the service node to the service request includes:
  • the response status of the service node to the service request is monitored by multi-threading.
  • the step of monitoring, by the multi-thread, the response status of the service node to the service request includes:
  • the step of monitoring, by the multi-thread, the response status of the service node to the service request includes:
  • the time interval D2 of the identifier adding time point T2 and the current time point N2 corresponding to the storage response timeout identifier is used as the response time, and if the response time is greater than the second preset response time threshold, determining the The response status of the service node is abnormal.
  • the method before the step of monitoring, by the multi-thread, the response status of the service node to the service request, the method further includes:
  • the method further includes:
  • the method before the step of monitoring the response status of the service node to the service request, the method further includes:
  • the step of monitoring the response status of the service node to the service request is:
  • the registered monitoring result is polled at the beginning of the monitoring result registration list.
  • the method further includes:
  • the result of dividing the number of the multi-threads by the preset response time threshold is used as a frequency for monitoring the response state of the service node.
  • the present application further discloses a service node switching device of a distributed storage system, where the service node includes a current serving node and a standby serving node, and the device includes:
  • a service node response status monitoring module configured to monitor a response status of the service node to a service request
  • the current serving node switching triggering module is configured to stop communication between the current serving node and the standby serving node if the response status of the current serving node is abnormal, and trigger a handover process of the current serving node.
  • the device further includes:
  • the abnormal node identifier adding module is configured to add, to the service node, an abnormal node identifier for marking a handover process that does not participate in the current serving node, if the response state of the service node is abnormal.
  • the current serving node handover triggering module includes:
  • the triggering selection sub-module is configured to trigger to select at least one service node that does not carry the abnormal node identifier as a new current serving node, and replace the current serving node with an abnormal response status.
  • the service node response status monitoring module includes:
  • the multi-thread monitoring sub-module is configured to monitor, by the multi-thread, the response status of the service node to the service request.
  • the multi-thread monitoring submodule includes:
  • a first check thread subunit configured to acquire, by using the first check thread, a time point T1 when the service node recently takes out the service request from the service request queue, and use the time interval D1 with the current time point N1 as the response time;
  • the first preset response time threshold determining subunit is configured to determine whether the response time is greater than a first preset response time threshold, and if yes, determine that the response status of the service node is abnormal.
  • the multi-thread monitoring submodule includes:
  • a second check thread subunit configured to determine, by the second check thread, whether the storage unit of the service node carries a storage response timeout identifier; if yes, invoke a second preset response time threshold judgment subunit;
  • a second preset response time threshold determining subunit configured to use the time interval D2 of the identifier adding time point T2 and the current time point N2 corresponding to the storage response timeout identifier as the response time, if the response time is greater than the second
  • the preset response time threshold determines that the response status of the service node is abnormal.
  • the device further includes:
  • a storage unit response time determining module configured to acquire, by the logging thread, a start and end time of the storage unit write log of the service node, and use the time interval of the start and end time as a storage unit response time;
  • a storage response timeout identifier adding module configured to determine whether the storage unit response time is greater than a third preset response time threshold, and if yes, adding the storage response timeout identifier to the storage unit, and correspondingly recording a time point for adding the identifier T2.
  • the device further includes:
  • a storage response timeout identifier deleting module configured to delete the storage response timeout identifier if the storage unit response time is less than the third preset response time threshold, and the storage unit has carried the storage response timeout identifier .
  • the device further includes:
  • a monitoring result registration module configured to register monitoring results of at least one of the service nodes to a monitoring result registration list
  • the service node response status monitoring module includes:
  • the monitoring result polling sub-module is configured to poll the registered monitoring result at the beginning of the monitoring result registration list.
  • the device further includes:
  • a monitoring frequency determining module configured to divide a number of the multiple threads by the preset response time threshold as a frequency for monitoring a response state of the serving node.
  • the embodiment of the present application monitors the response status of the service node to the service request, and the response status is abnormal.
  • the former serving node performs communication between it and the alternate serving node, thereby triggering the switching process of the current serving node.
  • the service node check logic logical judgment and data statistics are performed on various factors that affect the response state of the service node, and the service is implemented when a service failure timeout, service unavailability, service abnormality, etc. caused by hardware failure or software defect is encountered.
  • the node's autonomous switching and recovery enhances service availability, improves service recovery efficiency, and improves user experience.
  • the embodiment of the present application can monitor factors affecting one or more aspects of the service node response state, and the multi-dimensional monitoring means improves the comprehensiveness and scalability of the service recovery.
  • the embodiment of the present application does not directly initialize the abnormal current serving node, but uses a relatively conservative manner to stop the communication with the standby service node to trigger the switching of the service node.
  • the misdiagnosis is abnormal.
  • the current service node also has the opportunity to re-establish the service as a new current service node, avoiding the negative impact of misdiagnosis on the entire system.
  • the embodiment of the present application adds an abnormal node identifier to the service node whose response state is abnormal, and avoids the problem that the service node whose response state is abnormal is selected as the current service node, thereby failing to achieve the purpose of the service node switching. Moreover, the abnormal service node is excluded, and the switching process of the service node can ensure the stability of the new current service node, avoid system fluctuation caused by multiple service node switching, and improve the stability of service recovery.
  • Embodiment 1 is a flow chart showing the steps of Embodiment 1 of a service node switching method of a distributed storage system according to the present application;
  • Embodiment 2 is a flow chart of steps of Embodiment 2 of a service node switching method of a distributed storage system according to the present application;
  • Embodiment 3 is a flow chart showing the steps of Embodiment 3 of a service node switching method of a distributed storage system according to the present application;
  • Embodiment 4 is a flow chart showing the steps of Embodiment 4 of a service node switching method of a distributed storage system according to the present application;
  • Embodiment 1 is a structural block diagram of Embodiment 1 of a service node switching apparatus of a distributed storage system according to the present application;
  • FIG. 6 is a structural block diagram of Embodiment 2 of a service node switching apparatus of a distributed storage system according to the present application.
  • the metadata service utilizes the alternate service node to switch the current service node of the abnormal state.
  • the service node that obtains the distributed lock is used as the current service node, and is provided by the current service node, and the generated log is stored on the shared storage device, and other standby service nodes are not provided externally. Yuan According to the service, only the log application is read from the shared storage device to the memory, and the memory is kept in sync with the current service node.
  • the standby service node detects the state of the lock from time to time.
  • the lock When the lock is released, it indicates that the current service node is in an abnormal state such as server downtime, service process restart, network disconnection, etc., and the standby service node acquires the distributed lock and upgrades to a new current state.
  • the service node provides metadata services externally.
  • multiple standby service nodes perform current service node election through the Paxos protocol, and generate a current service node that provides a metadata service to the current service node, and the user requests the metadata service from the current service node, and the current service.
  • the log is stored locally and sent to all standby service nodes.
  • the standby service node stores it locally and applies it to the memory, keeping it in sync with the current service node.
  • the current serving node when the current serving node is working normally, it can send heartbeat confirmation information to the standby serving node, and the standby serving node confirms the survival of the current serving node through the heartbeat mechanism.
  • the current service node If the current service node has abnormal conditions such as server downtime, service process restart, network disconnection, etc., the current service node cannot send heartbeat confirmation information to the standby service node, and the standby service node initiates the current service node handover process, and the slave service is initiated.
  • a new current service node is elected in the node to provide a metadata service to the outside.
  • the triggering of the service node is dependent on a serious fault that causes the current service node to be completely inoperable, and other faults that cause the current service node to respond slowly and the like are not.
  • the switching of the service node is triggered, however the current service node in an abnormal state has affected the quality of providing the service to the user.
  • the current service node switching method has a low efficiency of metadata service recovery, which affects the user experience. Moreover, according to the current service node switching method, even if the service node switching process is triggered, it is possible to switch to the service node that is already in the abnormal state again, and the purpose of the service node switching cannot be achieved, which affects the efficiency of the metadata service recovery. In order to solve the above problem, an embodiment of a plurality of service node switching methods is proposed below.
  • the service node includes a current serving node and a standby serving node, and the method may specifically include the following steps. :
  • Step 101 Monitor a response status of the service node to the service request.
  • Metadata also known as mediation data, relay data, is data about data, mainly information describing the properties of the data, used to support such as storage location, historical data, resource search, File recording and other functions.
  • a plurality of check threads may be set correspondingly in the system to check whether the response status of the service node is abnormal. For example, set a first check thread that focuses on the service request queue response time exception and/or set a second check thread that focuses on the storage unit response time exception.
  • the response time threshold may be preset to be 10 seconds, if the service node responds to some service request in the service request queue. If the response time exceeds 10 seconds, or the response time of the storage unit of the service node is more than 10 seconds, it can be understood that the response status of the service node is abnormal, and the response to the service request requires the user to wait for a long time, or even fails to serve normally. Has affected the user experience.
  • the response status of the service node to the service request may be monitored by multiple threads. Because in the actual application, the response state of the service node may be affected by various factors, a person skilled in the art may set multiple check threads according to actual situations to monitor the response status of the service node to the service request, for example, Sets a check thread that focuses on whether the storage unit is near full load.
  • the inspection thread for monitoring the service node may not have the ability to execute logic, but only serve as a logical judgment and data statistics for checking the response status. Check the thread to keep it as light as possible, and avoid the processing with large amount of calculation and long time. For example, you can not perform RPC (Remote Procedure Call Protocol) operation or long-term lock operation. As a result, subsequent service node switching processing is severely delayed or even impossible to perform effectively, which may eventually cause the entire inspection mechanism to lose its desired effect.
  • RPC Remote Procedure Call Protocol
  • Step 102 If the response status of the current serving node is abnormal, stop communication between the current serving node and the standby serving node, and trigger a handover process of the current serving node.
  • the response status of the service node by monitoring the response status of the service node, it may be determined whether the response status of the current serving node and the standby serving node is abnormal. Corresponding operations can be performed for different monitoring results of different service nodes. For the current service node responding to an abnormal state, communication with multiple standby service nodes can be stopped.
  • the current service node is in an abnormal state and fails, and the handover process of the current serving node needs to be initiated.
  • the standby service node cannot communicate with the current service node normally.
  • the current service node can be considered to be in an abnormal state, thereby triggering a new current service node election operation, and replacing the newly elected current service node with the abnormal current service node to complete the current Switching processing of service nodes.
  • the current service node election operation can be done through Paxos Protocol implementation.
  • the switching process of the current serving node may be triggered in other manners, for example, by using a distributed lock service. If the standby service node considers that the current serving node is in an abnormal state and fails, the current serving node is triggered to release the lock, and multiple The standby service node performs a lock-up operation, and replaces the abnormal current service node by the standby service node that acquires the lock to complete the handover process of the current service node.
  • an abnormal node identifier may also be added, and the service node carrying the identifier is not switched to the new current service node in the handover process of the service node, so as to avoid The abnormal service node is used as the new current service node, and the service recovery effect is not achieved.
  • the monitoring result of the at least one service node may be registered to the monitoring result registration list; the registered monitoring result is polled at the starting position of the monitoring result registration list.
  • a checkpoint execution module can be set to perform corresponding operations for different monitoring results.
  • Each inspection thread can generate monitoring results during system startup and register the monitoring results in the monitoring result registration list of the checkpoint execution module.
  • the checkpoint execution module may be a system background thread, which may poll the monitoring results one by one at the beginning of the registration list, and perform corresponding processing according to the monitoring result.
  • the first checking thread determines that the response state of the current serving node is abnormal according to the service request processing response time, and the checkpoint executing module can stop sending the heartbeat confirmation information to the standby serving node, and add a An abnormal node identifier; for example, in another monitoring result, the second check thread determines that the response status of the standby service node is abnormal according to the read/write log timeout of the storage unit, and the checkpoint execution module thereby adds the abnormal node identifier to the standby service node.
  • the checkpoint execution module may not need to pay attention to how the logic of each check thread is implemented, that is, it does not need to pay attention to how the check thread specifically monitors whether the service node is abnormal, and only pays attention to the service node response status reflected by the monitoring result. Whether it is abnormal. Specifically, whether the response state of the service node is abnormal may be represented by True and False, and the check thread may register only the value of True or False as a monitoring result in the registration list of the checkpoint execution module.
  • a result of dividing the number of the multi-threads by the preset response time threshold may also be used as a frequency for monitoring a response state of the serving node.
  • the execution interval of the monitoring cannot be greater than a preset response time threshold set by any one of the inspection threads for determining whether the response state is abnormal. For example, if the preset response time threshold is 10 seconds, the monitoring execution interval can be set to 1 second.
  • the number of inspection threads to be monitored can be divided by the preset response time threshold, and the result is used as the monitoring frequency. For example, if the check thread is 10 and the preset response time threshold is 10 seconds, the monitoring frequency is 1 second, that is, the checkpoint execution module can retrieve a monitoring result from the monitoring result registration list every 1 second to execute. Processing accordingly.
  • a method provided by the embodiments of the present application can be applied to various distributed file systems and computing and storage platforms according to actual conditions, for example, an HDFS system (Hadoop Distributed File System) and an ODPS computing platform (Open Data Processing Service). , Open Data Processing Service), OSS Storage Platform (Object Storage Service), OTS Storage Platform (Open Table Service, Open Table Service Structured Data Service), ECS Computing Platform (Elastic Compute Service) and many more.
  • HDFS High Speed Distributed File System
  • ODPS computing platform Open Data Processing Service
  • Open Data Processing Service Open Data Processing Service
  • OSS Storage Platform Object Storage Service
  • OTS Storage Platform Open Table Service
  • Open Table Service Structured Data Service Open Table Service Structured Data Service
  • ECS Computing Platform Elastic Compute Service
  • the embodiment of the present application monitors the response status of the service node to the service request, and stops the communication between the service node and the standby service node for the current service node with the abnormal response status, thereby triggering the current Switching processing of service nodes.
  • the service node check logic logical judgment and data statistics are performed on various factors that affect the response state of the service node, and the service is implemented when a service failure timeout, service unavailability, service abnormality, etc. caused by hardware failure or software defect is encountered.
  • the node's autonomous switching and recovery enhances service availability, improves service recovery efficiency, and improves user experience.
  • the embodiment of the present application can monitor factors affecting one or more aspects of the service node response state, and the multi-dimensional monitoring means improves the comprehensiveness and scalability of the service recovery.
  • the embodiment of the present application does not directly initialize the abnormal current serving node, but uses a relatively conservative manner to stop the communication with the standby service node to trigger the switching of the service node.
  • the misdiagnosis is abnormal.
  • the current service node also has the opportunity to re-establish the service as a new current service node, avoiding the negative impact of misdiagnosis on the entire system.
  • the service node includes a current serving node and a standby serving node, and the method may specifically include the following steps. :
  • Step 201 Acquire, by the first check thread, a time point T1 at which the service node recently takes out the service request from the service request queue, and use the time interval D1 with the current time point N1 as the response time.
  • the foregoing first check thread may be a check thread that focuses on whether the service request queue response time is abnormal.
  • the service request is first placed in the service request queue. Queuing in the queue, waiting to be processed one by one by the current serving node.
  • the current service node takes a service request from the queue, it can record the time point T1 at this time.
  • the first check thread may periodically check the service request queue.
  • the first check thread acquires the time point T1 of the previously recorded recent service request, and sets the current time points N1 and T1.
  • the time interval D1 serves as the response time of the current serving node.
  • Step 202 Determine whether the response time is greater than a first preset response time threshold, and if yes, determine that the response status of the service node is abnormal.
  • the response time may be compared with the first preset response time threshold. If the response time is greater than the first preset response time threshold, it indicates that the current serving node is blocked when processing the user's service request, causing the user to wait for service for a long time. happensing. Therefore, regardless of whether other aspects of the current serving node are normal, the response state of the current serving node may be considered abnormal.
  • the first preset response time threshold may be set to 10 seconds, that is, if the service request submitted by the user is not successfully responded within 10 seconds, it may be understood that the response state of the current service node is abnormal, of course, A first preset response time threshold may be set according to the actual situation, which is not limited by the embodiment of the present application.
  • the first check thread may register the monitoring status abnormality or the normal monitoring result to the monitoring result registration list of the checkpoint execution module, and the checkpoint execution module takes corresponding processing according to the monitoring result.
  • Step 203 If the response status of the current serving node is abnormal, stop communication between the current serving node and the standby serving node, and trigger a handover process of the current serving node.
  • Step 204 If the response status of a service node is abnormal, add an abnormal node identifier for marking the handover process that does not participate in the current serving node.
  • the checkpoint execution module can stop communication with multiple standby service nodes for the current service node that responds to an abnormal state.
  • an abnormal node identifier may also be added, and the service node carrying the identifier is not switched to the new current service node in the handover process of the service node.
  • the step of triggering the handover process of the current serving node may include: triggering selection of at least one service node that does not carry the abnormal node identifier as a new current serving node, and replacing the current state of the response state exception. Service node.
  • the service node carrying the abnormal node identifier will not participate in the election.
  • the alternate service node triggers the handover process of the current serving node, carrying the exception section If the service node identified by the point does not participate in the election, it will not be selected as the new current service node. If the service node is switched through the distributed lock service, the service node carrying the abnormal node identifier does not perform the lock-up operation, and only the normal service node participates in the lock-up.
  • the embodiment of the present application adds an abnormal node identifier to a service node with an abnormal response state, which avoids the problem that the service node whose response state is abnormal is selected as the current service node, thereby failing to achieve the purpose of the service node switching. Moreover, the abnormal service node is excluded, and the switching process of the service node can ensure the stability of the new current service node, avoid system fluctuation caused by multiple service node switching, and improve the stability of service recovery.
  • the service node includes a current serving node and a standby serving node, and the method may specifically include the following steps. :
  • Step 301 Acquire a start and end time of a storage unit write log of the service node by using a logging thread, and use the time interval of the start and end time as a storage unit response time.
  • the logging thread records the start and end time points at which the service node starts to write the log and the end time point after the write log is completed, and uses the time interval of the start time point as the storage unit response time.
  • Step 302 Determine whether the storage unit response time is greater than a third preset response time threshold. If yes, add the storage response timeout identifier to the storage unit, and record the time point T2 at which the identifier is added.
  • the storage unit response time is greater than the third preset response time threshold. If the storage unit of the service node is abnormal, the storage response timeout identifier may be added to the storage unit, and the time point T2 when the identifier is added may be recorded. If the storage unit of the service node already carries the storage response timeout identifier, the identification addition process may not be required.
  • Step 303 If the storage unit response time is less than the third preset response time threshold, and the storage unit has carried the storage response timeout identifier, delete the storage response timeout identifier.
  • the inspection thread needs to pay attention to the abnormal situation that the storage unit responds to the continuous slow response.
  • the slow response of the storage unit for a single time may be caused by chance, and may be temporarily ignored to avoid misdetection. Therefore, if the storage unit response time is less than the third preset response time threshold and the storage response timeout identifier has been carried, the identifier may be deleted.
  • Step 304 Determine, by the second check thread, whether the storage unit of the serving node carries a storage response timeout identifier.
  • Step 305 if yes, the time interval D2 of the identifier adding time point T2 and the current time point N2 corresponding to the storage response timeout identifier is used as the response time, and if the response time is greater than the second preset response time threshold, Determining that the response status of the service node is abnormal.
  • the second check thread may determine whether the storage unit carries the storage response timeout identifier, and perform corresponding processing according to the judgment result.
  • the adding time point T2 of the identifier is obtained from the logging thread, and subtracted from the current time point N2 to obtain the time interval D2 of the two as the response time of the serving node. If the response time is greater than the second preset response time threshold, it indicates that the service node records the log to the storage unit for too long, which affects the response time to the service request. Therefore, regardless of whether other aspects of the current serving node are normal, the response state of the current serving node may be considered abnormal.
  • the second check thread registers the response status abnormality or the normal monitoring result to the monitoring result registration list of the checkpoint execution module, and the checkpoint execution module takes corresponding processing according to the monitoring result. For example, if the second check thread checks that the response time of the storage unit does not decrease below 30 milliseconds within 30 seconds, it may be determined that the storage unit of the service node is abnormal, resulting in an abnormal response state of the service node.
  • Step 306 If the response status of the current serving node is abnormal, stop communication between the current serving node and the standby serving node, and trigger a handover process of the current serving node.
  • Step 307 If the response status of a certain service node is abnormal, add an abnormal node identifier for marking the handover process that does not participate in the current serving node.
  • the checkpoint execution module can stop communication with multiple standby service nodes for the current service node that responds to an abnormal state.
  • an abnormal node identifier may also be added, and the service node carrying the identifier is not switched to the new current service node in the handover process of the service node.
  • step 301 to step 303 can be performed cyclically, and the response time of the storage unit write log is repeatedly counted and compared, and it is determined whether there is an abnormal condition that the storage unit continuously responds slowly, and the storage unit is updated accordingly.
  • the timeout identifier is responded to, so that the second inspection thread performs corresponding processing according to the identifier.
  • the service node includes a current serving node and a standby serving node, and the method may specifically include the following steps. :
  • Step 401 Monitor, by the multi-thread, the response status of the service node to the service request.
  • the response status of the service node to the service request is abnormal by a plurality of inspection threads that focus on different aspects of the service node. Because in the actual application, the response state of the service node may be affected by many factors, whether it is a unilateral factor or a combination of multiple factors, when the response state of the service node is affected, it can be monitored in a targeted manner. Therefore, in order to monitor the service node more comprehensively and flexibly, the service node can be monitored through a combination of multiple threads. Of course, the number of threads and the specific combination of threads can be determined by those skilled in the art according to actual conditions.
  • the step 401 may specifically include the following sub-steps:
  • Sub-step S11 the time point T1 at which the service node recently takes out the service request from the service request queue is acquired by the first check thread, and the time interval D1 from the current time point N1 is taken as the response time.
  • Sub-step S12 determining whether the response time is greater than a first preset response time threshold, and if yes, determining that the response status of the service node is abnormal.
  • Sub-step S13 determining, by the second check thread, whether the storage unit of the serving node carries a storage response timeout identifier.
  • Sub-step S14 if yes, the time interval D2 of the identifier adding time point T2 and the current time point N2 corresponding to the storage response timeout identifier is used as the response time, and if the response time is greater than the second preset response time threshold, Then determining that the response status of the service node is abnormal.
  • the first check thread described above may be a thread that focuses on whether the service request queue response time is abnormal. By monitoring the service node by the first check thread, it is possible to monitor the abnormality of the service node caused by the slow response time of the processing service request queue.
  • the second check thread described above may be a thread that focuses on whether the storage unit response time is abnormal. By monitoring the service node through the second check thread, it is possible to monitor the abnormality of the service node caused by the storage unit writing log being too slow. It should be noted that the foregoing sub-steps are not successively performed, that is, the first check thread and the second check thread can be simultaneously monitored.
  • Step 402 If the response status of the current serving node is abnormal, stopping the current serving node and the standby Communication between service nodes and triggering handover processing of the current service node.
  • Step 403 If the response status of a service node is abnormal, add an abnormal node identifier for marking the handover process that does not participate in the current serving node.
  • the first check thread and the second check thread simultaneously monitor the response status of the service node to the service request, and can simultaneously monitor the processing service request queue response time and the storage unit write log response time, and when any problem occurs, The switching process of the service node can be triggered, and the abnormal node identifier is added in a targeted manner.
  • the multi-dimensional monitoring means improves the comprehensiveness and scalability of service recovery.
  • the service node includes a current serving node and a standby serving node, and the device may specifically include the following modules:
  • the service node response status monitoring module 501 is configured to monitor the response status of the service node to the service request.
  • the current serving node switching triggering module 502 is configured to stop communication between the current serving node and the standby serving node if the response status of the current serving node is abnormal, and trigger a handover process of the current serving node.
  • the embodiment of the present application monitors the response status of the service node to the service request, and stops the communication between the service node and the standby service node for the current service node with the abnormal response status, thereby triggering the handover process of the current service node.
  • the service node check logic logical judgment and data statistics are performed on various factors that affect the response state of the service node, and the service is implemented when a service failure timeout, service unavailability, service abnormality, etc. caused by hardware failure or software defect is encountered.
  • the node's autonomous switching and recovery enhances service availability, improves service recovery efficiency, and improves user experience.
  • the embodiment of the present application can monitor factors affecting one or more aspects of the service node response state, and the multi-dimensional monitoring means improves the comprehensiveness and scalability of the service recovery.
  • the embodiment of the present application does not directly initialize the abnormal current serving node, but uses a relatively conservative manner to stop the communication with the standby service node to trigger the switching of the service node.
  • the misdiagnosis is abnormal.
  • the current service node also has the opportunity to re-establish the service as a new current service node, avoiding the negative impact of misdiagnosis on the entire system.
  • the service node includes a current serving node and a standby serving node, and the device may specifically include the following modules:
  • the monitoring result registration module 601 is configured to register the monitoring result of the at least one of the service nodes to the monitoring result registration list.
  • the service node response status monitoring module 602 is configured to monitor the response status of the service node to the service request.
  • the current serving node switching triggering module 603 is configured to stop communication between the current serving node and the standby serving node if the response status of the current serving node is abnormal, and trigger a handover process of the current serving node.
  • the abnormal node identifier adding module 604 is configured to add, to the service node, an abnormal node identifier for marking a handover process that does not participate in the current serving node if the response state of the certain service node is abnormal.
  • the monitoring frequency determining module 605 is configured to divide the number of the multi-threads by the preset response time threshold as a frequency for monitoring the response status of the serving node.
  • the apparatus may further include:
  • the storage unit response time determining module is configured to acquire a start and end time of the storage unit write log of the service node by using a logging thread, and use the time interval of the start and end time as a storage unit response time.
  • a storage response timeout identifier adding module configured to determine whether the storage unit response time is greater than a third preset response time threshold, and if yes, adding the storage response timeout identifier to the storage unit, and correspondingly recording a time point for adding the identifier T2.
  • a storage response timeout identifier deleting module configured to delete the storage response timeout identifier if the storage unit response time is less than the third preset response time threshold, and the storage unit has carried the storage response timeout identifier .
  • the current serving node switching trigger module 603 may include the following sub-modules:
  • the triggering selection sub-module is configured to trigger to select at least one service node that does not carry the abnormal node identifier as a new current serving node, and replace the current serving node with an abnormal response status.
  • the service node response status monitoring module 602 may include the following sub-modules:
  • the multi-thread monitoring sub-module is configured to monitor, by the multi-thread, the response status of the service node to the service request.
  • the multi-thread monitoring sub-module may include the following sub-units:
  • the first check thread subunit is configured to acquire, by using the first check thread, a time point T1 when the service node recently takes out the service request from the service request queue, and use the time interval D1 with the current time point N1 as the response time.
  • the first preset response time threshold determining subunit is configured to determine whether the response time is greater than a first preset response time threshold, and if yes, determine that the response status of the service node is abnormal.
  • the multi-thread monitoring sub-module may include the following sub-units:
  • a second check thread subunit configured to determine, by the second check thread, whether the storage unit of the service node carries a storage response timeout identifier; if yes, invoke a second preset response time threshold to determine the subunit.
  • a second preset response time threshold determining subunit configured to use the time interval D2 of the identifier adding time point T2 and the current time point N2 corresponding to the storage response timeout identifier as the response time, if the response time is greater than the second
  • the preset response time threshold determines that the response status of the service node is abnormal.
  • the service node response status monitoring module 602 may include the following sub-modules:
  • the monitoring result polling sub-module is configured to poll the registered monitoring result at the beginning of the monitoring result registration list.
  • the embodiment of the present application adds an abnormal node identifier to a service node with an abnormal response state, which avoids the problem that the service node whose response state is abnormal is selected as the current service node, thereby failing to achieve the purpose of the service node switching. Moreover, the abnormal service node is excluded, and the switching process of the service node can ensure the stability of the new current service node, avoid system fluctuation caused by multiple service node switching, and improve the stability of service recovery.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), EEPROM, Fast Flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non- A transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • EEPROM Electrically erasable programmable read-only Memory
  • Fast Flash memory or other memory technology
  • CD-ROM compact disc
  • DVD digital versatile disc
  • magnetic cassette magnetic tape storage or other magnetic storage device or any other non- A transmission medium that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Hardware Redundancy (AREA)
  • Debugging And Monitoring (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)

Abstract

本申请实施例提供了一种分布式存储系统的服务节点切换方法和装置,服务节点包括当前服务节点和备用服务节点,所述方法包括:监控所述服务节点对服务请求的响应状态;若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。通过服务节点检查逻辑,针对影响服务节点响应状态的多方面因素进行逻辑判断和数据统计,在遇到硬件故障或软件缺陷带来的服务超时、服务不可用、服务异常等的情况时,实现服务节点的自主切换和恢复,增强了服务可用性。

Description

一种分布式存储系统的服务节点切换方法和装置
本申请要求2015年12月08日递交的申请号为201510897877.X、发明名称为“一种分布式存储系统的服务节点切换方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,特别是涉及一种分布式存储系统的服务节点切换方法和一种分布式存储系统的服务节点切换装置。
背景技术
在当前大规模分布式存储系统中,为了实现集中权限认证和配额控制,主要采用集中式元数据管理的方法,即将整个系统中所有数据的元数据集中在若干个元数据服务节点进行存储。
这样的架构中元数据服务节点的可用性直接关系到整个系统的可用性,因此在分布式存储系统中通常通过冗余的方式提升元数据服务节点的可用性。目前提升元数据服务节点可用性的主要有两种方法,例如由元数据服务(Name Node)通过HA的方式(High Availablity,高可用性),利用备用服务节点(Slave节点)切换掉异常状态的当前服务节点(Primary节点);或者在例如阿里云飞天分布式系统和盘古文件存储系统使用Paxos协议实现服务节点的切换。
上述两种服务节点切换方法中,仅仅在由于服务器宕机、服务进程重启、网络断开等情况而导致的当前服务节点无法正常向备用服务节点发送心跳确认的情况下,才会触发服务节点的切换处理,在其他的例如双工网络单路断开、部分网络协议异常、磁盘响应慢等的异常情况下,由于锁维护机制和心跳机制,备用服务节点还是会认为当前服务节点处于正常的工作状态,而不会触发服务节点的切换处理。
然而,实际上处于异常状态的当前服务节点会导致响应用户的服务请求超时,无法提供完整的元数据,无法将日志存储在共享存储设备上等的问题,实际上已经影响到当前服务节点向用户提供的服务质量,但目前的服务节点切换方法无法相应地恢复正常和稳定的元数据服务。因此,目前的服务节点切换方法存在元数据服务恢复效率较低,影响了用户体验的问题。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解 决上述问题的一种分布式存储系统的服务节点切换方法和相应的一种分布式存储系统的服务节点切换装置。
为了解决上述问题,本申请公开了一种分布式存储系统的服务节点切换方法,所述服务节点包括当前服务节点和备用服务节点,所述方法包括:
监控所述服务节点对服务请求的响应状态;
若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
可选地,所述方法还包括:
若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
可选地,所述触发当前服务节点的切换处理的步骤包括:
触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
可选地,所述监控所述服务节点对服务请求的响应状态的步骤包括:
通过多线程监控所述服务节点对服务请求的响应状态。
可选地,所述通过多线程监控所述服务节点对服务请求的响应状态的步骤包括:
通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间;
判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
可选地,所述通过多线程监控所述服务节点对服务请求的响应状态的步骤包括:
通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识;
若是,则将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
可选地,在所述通过多线程监控所述服务节点对服务请求的响应状态的步骤之前,所述方法还包括:
通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间;
判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
可选地,所述方法还包括:
若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
可选地,在所述监控所述服务节点对服务请求的响应状态的步骤之前,所述方法还包括:
将对至少一个所述服务节点的监控结果注册到监控结果注册列表;
所述监控所述服务节点对服务请求的响应状态的步骤为:
在所述监控结果注册列表的起始位置轮询注册的监控结果。
可选地,所述方法还包括:
将所述多线程的个数除以所述预设响应时间阈值的结果作为监控所述服务节点响应状态的频率。
为了解决上述问题,本申请还公开了一种分布式存储系统的服务节点切换装置,所述服务节点包括当前服务节点和备用服务节点,所述装置包括:
服务节点响应状态监控模块,用于监控所述服务节点对服务请求的响应状态;
当前服务节点切换触发模块,用于若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
可选地,所述装置还包括:
异常节点标识添加模块,用于若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
可选地,所述当前服务节点切换触发模块包括:
触发选取子模块,用于触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
可选地,所述服务节点响应状态监控模块包括:
多线程监控子模块,用于通过多线程监控所述服务节点对服务请求的响应状态。
可选地,所述多线程监控子模块包括:
第一检查线程子单元,用于通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间;
第一预设响应时间阈值判断子单元,用于判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
可选地,所述多线程监控子模块包括:
第二检查线程子单元,用于通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识;若是,则调用第二预设响应时间阈值判断子单元;
第二预设响应时间阈值判断子单元,用于将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
可选地,所述装置还包括:
存储单元响应时间确定模块,用于通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间;
存储响应超时标识添加模块,用于判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
可选地,所述装置还包括:
存储响应超时标识删除模块,用于若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
可选地,所述装置还包括:
监控结果注册模块,用于将对至少一个所述服务节点的监控结果注册到监控结果注册列表;
所述服务节点响应状态监控模块包括:
监控结果轮询子模块,用于在所述监控结果注册列表的起始位置轮询注册的监控结果。
可选地,所述装置还包括:
监控频率确定模块,用于将所述多线程的个数除以所述预设响应时间阈值的结果作为监控所述服务节点响应状态的频率。
本申请实施例包括以下优点:
本申请实施例通过监控服务节点对服务请求的响应状态,并针对响应状态异常的当 前服务节点执行停止其与备用服务节点之间的通讯,由此触发当前服务节点的切换处理。通过服务节点检查逻辑,针对影响服务节点响应状态的多方面因素进行逻辑判断和数据统计,在遇到硬件故障或软件缺陷带来的服务超时、服务不可用、服务异常等的情况时,实现服务节点的自主切换和恢复,增强了服务可用性,提升了服务恢复效率,改善了用户体验。
其次,本申请实施例可以针对影响服务节点响应状态的一个或多个方面的因素进行监控,多维度的监控手段提升了服务恢复的全面性和可扩展性。
进一步,本申请实施例对于异常的当前服务节点不将其直接初始化,而是采用相对保守的停止与备用服务节点通讯的方式来触发服务节点切换,当出现误诊事故时,被误诊为响应状态异常的当前服务节点也有机会重新作为新的当前服务节点,继续提供服务,避免了误诊对整个系统带来负面的影响。
进一步,本申请实施例对响应状态异常的服务节点添加异常节点标识,避免了将响应状态异常的服务节点被选作当前服务节点、从而无法实现服务节点切换目的的问题。而且,将异常的服务节点排除,服务节点的切换处理可以保证新的当前服务节点的稳定性,避免了多次服务节点切换引起的系统波动,提升了服务恢复的稳定性。
附图说明
图1是本申请的一种分布式存储系统的服务节点切换方法实施例一的步骤流程图;
图2是本申请的一种分布式存储系统的服务节点切换方法实施例二的步骤流程图;
图3是本申请的一种分布式存储系统的服务节点切换方法实施例三的步骤流程图;
图4是本申请的一种分布式存储系统的服务节点切换方法实施例四的步骤流程图;
图5是本申请的一种分布式存储系统的服务节点切换装置实施例一的结构框图;
图6是本申请的一种分布式存储系统的服务节点切换装置实施例二的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
目前较常用的提升元数据服务节点可用性的方式中,可以在例如Hadoop分布式文件系统中,元数据服务利用备用服务节点切换掉异常状态的当前服务节点。具体为:通过分布式锁服务,由获取到分布式锁的服务节点作为当前服务节点,由当前服务节点对外提供,并将产生的日志存储在共享存储设备上,其他的备用服务节点不对外提供元数 据服务,仅从共享存储设备上读取日志应用到内存,保持内存与当前服务节点同步一致。备用服务节点不定时检测锁的状态,当锁被释放时,即表明当前服务节点处于服务器宕机、服务进程重启、网络断开等异常状态,备用服务节点则获取分布式锁升级为新的当前服务节点并对外提供元数据服务。
另外一种较常用的提升可用性方式中,多个备用服务节点通过Paxos协议进行当前服务节点选举,从中产生一个对外提供元数据服务的当前服务节点,用户向当前服务节点请求元数据服务,当前服务节点响应后产生日志存储到本地,并发送给所有备用服务节点。备用服务节点收到日志后存储到本地并应用到内存,保持与当前服务节点同步一致。同时,在当前服务节点正常工作时,其可以向备用服务节点发送心跳确认信息,备用服务节点通过心跳机制确认当前服务节点的存活。若当前服务节点出现如服务器宕机、服务进程重启、网络断开等异常情况时,当前服务节点无法向备用服务节点发送心跳确认信息,备用服务节点由此发起当前服务节点切换处理,从备用服务节点中选举出新的当前服务节点对外提供元数据服务。
从上述可见,目前的提高元数据服务节点可用性的方法中,触发服务节点的切换依赖于导致当前服务节点完全无法工作的严重故障,其他一些引起当前服务节点响应慢等异常状态的故障则不会触发服务节点的切换,然而处于异常状态的当前服务节点已经影响到了向用户提供服务的质量。
因此,目前的服务节点切换方法存在元数据服务恢复效率较低,影响了用户体验的问题。而且,根据目前的服务节点切换方法,即使触发了服务节点切换处理,也有可能再次切换到已经处于异常状态的服务节点上,无法实现服务节点切换的目的,影响了元数据服务恢复的效率。为了解决上述问题,以下提出了若干个服务节点切换方法的实施例。
参照图1,示出了本申请的一种分布式存储系统的服务节点切换方法实施例一的步骤流程图,所述服务节点包括当前服务节点和备用服务节点,所述方法具体可以包括如下步骤:
步骤101,监控服务节点对服务请求的响应状态。
需要说明的是,服务节点可以为提供元数据服务的服务节点。元数据(Metadata)又称中介数据、中继数据,为描述数据的数据(data about data),主要是描述数据属性(property)的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。
在具体的实现中,可以针对不同方面的检查点,在系统中相应地设置若干个检查线程以监控服务节点的响应状态是否异常。例如,设置一个关注于服务请求队列响应时间异常的第一检查线程和/或设置一个关注于存储单元响应时间异常的第二检查线程。
需要说明的是,响应时间是否异常可以通过与预设的响应时间阈值进行比较而确定,例如,可以预设响应时间阈值为10秒,如果服务节点对服务请求队列中的某各服务请求的响应时间超过10秒,或者服务节点的存储单元进行日志读写的响应时间超过10秒,可以理解为该服务节点的响应状态异常,其对服务请求的响应需要用户长时间等待,甚至无法正常服务,已影响到了用户体验。
作为本申请实施例的优选示例,可以通过多线程监控所述服务节点对服务请求的响应状态。因为在实际应用中,服务节点的响应状态可能会被多方面的因素所影响,本领域技术人员可以根据实际情况组合设置多个检查线程以监控服务节点对服务请求的响应状态,例如,还可以设置关注于存储单元是否接近满载的检查线程。
优选地,用于监控服务节点的检查线程可以不具备执行逻辑的能力,而仅仅用作检查响应状态的逻辑判断和数据统计。检查线程尽量保持轻量,避免进行运算量较大和耗时较长的处理工作,例如可以不进行RPC(Remote Procedure Call Protocol,远程过程调用协议)操作或者长时间的等锁操作,该操作有可能导致后续的服务节点切换处理被严重延迟、甚至无法有效执行,最终可能导致整个检查机制失去应有的效果。
步骤102,若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
在具体的实现中,通过对服务节点的响应状态的监控,可以确定当前服务节点和备用服务节点的响应状态是否异常。针对于不同的服务节点的不同监控结果,可以执行相应的操作。针对于响应状态异常的当前服务节点,可以停止其与多个备用服务节点之间的通讯。
停止相互通讯的方式可以有多种,例如通过停止当前服务节点与备用服务节点之间的心跳确认的方式,当备用服务节点超过一定时间没有接收到当前服务节点发送的心跳确认信息,则可以认为当前服务节点处于异常状态并失效,需要发起当前服务节点的切换处理。
备用服务节点无法与当前服务节点正常通讯,可以认为当前服务节点处于异常状态,由此触发新的当前服务节点选举操作,并将新选举出的当前服务节点替换异常的当前服务节点,以完成当前服务节点的切换处理。当前服务节点的选举操作可以通过Paxos 协议实现。
实际应用中,还可以采用其他方式触发当前服务节点的切换处理,例如利用分布式锁服务,若备用服务节点认为该当前服务节点处于异常状态并失效,则触发该当前服务节点释放锁,多个备用服务节点进行抢锁操作,由获取到锁的备用服务节点替换异常的当前服务节点,以完成当前服务节点的切换处理。
针对于响应状态异常的当前服务节点和备用服务节点,还可以添加一个异常节点标识,携带有该标识的服务节点,则在服务节点的切换处理中不会被切换为新的当前服务节点,以免将异常的服务节点作为新的当前服务节点,达不到服务恢复的效果。
作为本申请实施例的优选示例,可以将对至少一个所述服务节点的监控结果注册到监控结果注册列表;在所述监控结果注册列表的起始位置轮询注册的监控结果。
实际应用中,可以设置检查点执行模块,以针对不同的监控结果执行相应的操作。各检查线程可以在系统启动过程中生成监控结果,并将监控结果注册到检查点执行模块的监控结果注册列表中。该检查点执行模块可以是系统后台线程,其可以在注册列表的起始位置开始按序逐一轮询监控结果,并根据监控结果执行相应的处理。例如,在某个监控结果中,第一检查线程根据服务请求处理响应时间判断出当前服务节点的响应状态异常,检查点执行模块由此可以停止其向备用服务节点发送心跳确认信息,并添加一个异常节点标识;又例如,在另外一个监控结果中,第二检查线程根据存储单元读写日志超时判断出备用服务节点的响应状态异常,检查点执行模块由此针对该备用服务节点添加异常节点标识。
需要说明的是,检查点执行模块可以不需要关注各个检查线程的逻辑判断如何实现,即不需要关注该检查线程具体如何监控该服务节点是否异常,仅仅关注监控结果所反映出的服务节点响应状态是否异常即可。具体地,服务节点的响应状态是否异常,可以由True和False表示,检查线程可以仅将True或False的值作为监控结果注册到检查点执行模块的注册列表中。
此外,还可以将所述多线程的个数除以所述预设响应时间阈值的结果作为监控所述服务节点响应状态的频率。
因为在实际应用中,如果监控的频率过低,则可能会遗漏了反映异常服务节点的监控结果,无法保证在当前服务节点出现异常时及时进行切换。因此,为了提高检查精度,监控的执行间隔不能大于任何一个检查线程用于判断响应状态是否异常而设置的预设响应时间阈值。例如,若预设的响应时间阈值为10秒,则监控的执行间隔可以设为1 秒。为了便于确定监控频率,可以根据进行监控的检查线程个数除以预设响应时间阈值,将结果作为监控频率。例如检查线程为10个,而预设响应时间阈值为10秒,则监控频率为1个每秒,即检查点执行模块可以每隔1秒就从监控结果注册列表调取一个监控结果,以执行相应处理。
本领域技术人员可以根据实际情况将本申请实施例所提供的方法应用于各种分布式文件系统和计算、存储平台,例如,HDFS系统(Hadoop Distributed File System),ODPS计算平台(Open Data Processing Service,开放数据处理服务),OSS存储平台(Object Storage Service,开放对象存储服务),OTS存储平台(Open Table Service,开放表服务结构化数据服务),ECS计算平台(Elastic Compute Service,弹性计算服务)等等。
相比起目前的服务节点切换方法,本申请实施例通过监控服务节点对服务请求的响应状态,并针对响应状态异常的当前服务节点执行停止其与备用服务节点之间的通讯,由此触发当前服务节点的切换处理。通过服务节点检查逻辑,针对影响服务节点响应状态的多方面因素进行逻辑判断和数据统计,在遇到硬件故障或软件缺陷带来的服务超时、服务不可用、服务异常等的情况时,实现服务节点的自主切换和恢复,增强了服务可用性,提升了服务恢复效率,改善了用户体验。
其次,本申请实施例可以针对影响服务节点响应状态的一个或多个方面的因素进行监控,多维度的监控手段提升了服务恢复的全面性和可扩展性。
进一步,本申请实施例对于异常的当前服务节点不将其直接初始化,而是采用相对保守的停止与备用服务节点通讯的方式来触发服务节点切换,当出现误诊事故时,被误诊为响应状态异常的当前服务节点也有机会重新作为新的当前服务节点,继续提供服务,避免了误诊对整个系统带来负面的影响。
参照图2,示出了本申请的一种分布式存储系统的服务节点切换方法实施例二的步骤流程图,所述服务节点包括当前服务节点和备用服务节点,所述方法具体可以包括如下步骤:
步骤201,通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间。
需要说明的是,上述的第一检查线程可以是关注于服务请求队列响应时间是否异常的检查线程。当用户向当前服务节点提交服务请求,会先将服务请求放入服务请求队列 中排队,等待被当前服务节点逐一处理。当前服务节点从队列中取出服务请求时,可以记录此时的时间点T1。
第一检查线程可以定期对服务请求队列进行检查,当服务请求队列存在等待处理的服务请求时,第一检查线程获取之前记录的最近取出服务请求的时间点T1,将当前时间点N1与T1的时间间隔D1作为该当前服务节点的响应时间。
步骤202,判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
可以将响应时间与第一预设响应时间阈值进行比较,若响应时间大于第一预设响应时间阈值,则表明该当前服务节点在处理用户的服务请求时被阻塞,造成用户长时间等待服务的情况。因此,无论该当前服务节点的其他方面是否正常,也可以认为该当前服务节点的响应状态异常。
实际应用中可以将第一预设响应时间阈值设置为10秒,也即是说,如果10秒之内也没有成功响应用户提交的服务请求,可以理解为当前服务节点的响应状态异常,当然,本领域技术人员可以根据实际情况设置第一预设响应时间阈值,本申请实施例对此不作限制。
第一检查线程可以将响应状态异常或正常的监控结果注册到检查点执行模块的监控结果注册列表中,由检查点执行模块根据监控结果采取相应的处理。
步骤203,若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
步骤204,若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
检查点执行模块针对于响应状态异常的当前服务节点,可以停止其与多个备用服务节点之间的通讯。针对于响应状态异常的当前服务节点和备用服务节点,还可以添加一个异常节点标识,携带有该标识的服务节点,则在服务节点的切换处理中不会被切换为新的当前服务节点。
作为本申请实施例的优选示例,所述触发当前服务节点的切换处理的步骤可以包括:触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
实际应用中,如果是通过Paxos协议选举当前服务节点,携带异常节点标识的服务节点不会参与到选举中。当备用服务节点触发当前服务节点的切换处理时,携带异常节 点标识的服务节点不参与选举,则不会被选作新的当前服务节点。如果是通过分布式锁服务切换服务节点,携带异常节点标识的服务节点不会进行抢锁操作,仅仅由正常的服务节点参与抢锁。
本申请实施例对响应状态异常的服务节点添加异常节点标识,避免了将响应状态异常的服务节点被选作当前服务节点、从而无法实现服务节点切换目的的问题。而且,将异常的服务节点排除,服务节点的切换处理可以保证新的当前服务节点的稳定性,避免了多次服务节点切换引起的系统波动,提升了服务恢复的稳定性。
参照图3,示出了本申请的一种分布式存储系统的服务节点切换方法实施例三的步骤流程图,所述服务节点包括当前服务节点和备用服务节点,所述方法具体可以包括如下步骤:
步骤301,通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间。
需要说明的是,用户提交服务请求时会产生日志,当前服务节点和备用服务节点均需要通过日志记录线程将日志记录到存储单元中,然后返回用户的服务请求处理成功的通知,因此存储单元的响应时间直接影响到对服务请求的响应时间。
在具体的实现中,日志记录线程记录有服务节点开始写日志的起止时间点和写日志完毕后的结束时间点,将起始时间点的时间间隔作为存储单元响应时间。
步骤302,判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
判断存储单元响应时间是否大于第三预设响应时间阈值,若是,表明服务节点的存储单元存在异常,则可以针对存储单元添加存储响应超时标识,并记录添加该标识时的时间点T2。如果该服务节点的存储单元已经携带有存储响应超时标识,则可以不需要作标识添加处理。
步骤303,若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
在实际应用中,检查线程需要关注的是存储单元响应连续响应慢的异常情况,对于单独一次的存储单元响应慢,可能是偶然因素造成,可以暂时忽略,以免造成误测。因此,若存储单元响应时间小于第三预设响应时间阈值,且已经携带有存储响应超时标识,可以删除该标识。
步骤304,通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识。
步骤305,若是,则将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
第二检查线程可以判断存储单元是否携带存储响应超时标识,并根据判断结果进行相应处理。
若携带存储响应超时标识,则从日志记录线程中获取该标识的添加时间点T2,与当前的时间点N2相减得到两者的时间间隔D2作为服务节点的响应时间。若该响应时间大于第二预设响应时间阈值,则表明该服务节点将日志记录到存储单元的耗时过长,影响到对服务请求的响应时间。因此,无论该当前服务节点的其他方面是否正常,也可以认为该当前服务节点的响应状态异常。
第二检查线程将响应状态异常或正常的监控结果注册到检查点执行模块的监控结果注册列表中,由检查点执行模块根据监控结果采取相应的处理。例如,第二检查线程检查出在30秒内存储单元的响应时间没有降低到30毫秒以下,则可以判断该服务节点的存储单元出现异常,导致服务节点的响应状态异常。
步骤306,若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
步骤307,若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
检查点执行模块针对于响应状态异常的当前服务节点,可以停止其与多个备用服务节点之间的通讯。针对于响应状态异常的当前服务节点和备用服务节点,还可以添加一个异常节点标识,携带有该标识的服务节点,则在服务节点的切换处理中不会被切换为新的当前服务节点。
需要说明的是,步骤301至步骤303可以循环进行,反复统计和比较存储单元写日志的响应时间,针对性地判断出是否存在存储单元连续响应慢的异常情况,并相应地更新存储单元的存储响应超时标识,以便于第二检查线程根据标识进行相应处理。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人 员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图4,示出了本申请的一种分布式存储系统的服务节点切换方法实施例四的步骤流程图,所述服务节点包括当前服务节点和备用服务节点,所述方法具体可以包括如下步骤:
步骤401,通过多线程监控所述服务节点对服务请求的响应状态。
可以通过多个关注于服务节点不同方面问题的检查线程,监控所述服务节点对服务请求的响应状态是否异常。因为在实际应用中,服务节点的响应状态可能会被多方面的因素所影响,无论是单方面的因素还是多个因素综合作用,当影响到服务节点的响应状态,均可以针对性地监控,因此,为了更全面和灵活地监控服务节点,可以通过多个线程的组合来监控服务节点。当然,线程的数量和线程具体的组合方式可以由本领域技术人员根据实际情况而定。
作为本申请实施例的优选示例,所述步骤401可以具体包括以下子步骤:
子步骤S11,通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间。
子步骤S12,判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
子步骤S13,通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识。
子步骤S14,若是,则将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
上述的第一检查线程可以是关注于服务请求队列响应时间是否异常的线程。通过第一检查线程监控服务节点,可以针对由于处理服务请求队列响应时间过慢所引起服务节点异常的情况进行监控。上述的第二检查线程可以是关注于存储单元响应时间是否异常的线程。通过第二检查线程监控服务节点,可以针对由于存储单元写日志过慢所引起服务节点异常的情况进行监控。需要说明的是,上述子步骤并没有先后之分,即可以同时通过第一检查线程和第二检查线程进行监控。
步骤402,若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用 服务节点之间的通讯,并触发当前服务节点的切换处理。
步骤403,若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
通过上述的第一检查线程和第二检查线程同时监控服务节点对服务请求的响应状态,可以同时针对处理服务请求队列响应时间和存储单元写日志响应时间进行监控,当任何一方面出现问题,均可以触发服务节点的切换处理,并针对性地添加异常节点标识。从而通过多维度的监控手段提升了服务恢复的全面性和可扩展性。
参照图5,示出了本申请的一种分布式存储系统的服务节点切换装置实施例一的结构框图,所述服务节点包括当前服务节点和备用服务节点,所述装置具体可以包括如下模块:
服务节点响应状态监控模块501,用于监控所述服务节点对服务请求的响应状态。
当前服务节点切换触发模块502,用于若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
本申请实施例通过监控服务节点对服务请求的响应状态,并针对响应状态异常的当前服务节点执行停止其与备用服务节点之间的通讯,由此触发当前服务节点的切换处理。通过服务节点检查逻辑,针对影响服务节点响应状态的多方面因素进行逻辑判断和数据统计,在遇到硬件故障或软件缺陷带来的服务超时、服务不可用、服务异常等的情况时,实现服务节点的自主切换和恢复,增强了服务可用性,提升了服务恢复效率,改善了用户体验。
其次,本申请实施例可以针对影响服务节点响应状态的一个或多个方面的因素进行监控,多维度的监控手段提升了服务恢复的全面性和可扩展性。
进一步,本申请实施例对于异常的当前服务节点不将其直接初始化,而是采用相对保守的停止与备用服务节点通讯的方式来触发服务节点切换,当出现误诊事故时,被误诊为响应状态异常的当前服务节点也有机会重新作为新的当前服务节点,继续提供服务,避免了误诊对整个系统带来负面的影响。
参照图6,示出了本申请的一种分布式存储系统的服务节点切换装置实施例二的结构框图,所述服务节点包括当前服务节点和备用服务节点,所述装置具体可以包括如下模块:
监控结果注册模块601,用于将对至少一个所述服务节点的监控结果注册到监控结果注册列表。
服务节点响应状态监控模块602,用于监控所述服务节点对服务请求的响应状态。
当前服务节点切换触发模块603,用于若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
异常节点标识添加模块604,用于若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
监控频率确定模块605,用于将所述多线程的个数除以所述预设响应时间阈值的结果作为监控所述服务节点响应状态的频率。
作为本申请实施例的优选示例,所述装置可以还包括:
存储单元响应时间确定模块,用于通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间。
存储响应超时标识添加模块,用于判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
存储响应超时标识删除模块,用于若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
作为本申请实施例的优选示例,所述当前服务节点切换触发模块603可以包括以下子模块:
触发选取子模块,用于触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
作为本申请实施例的优选示例,所述服务节点响应状态监控模块602可以包括以下子模块:
多线程监控子模块,用于通过多线程监控所述服务节点对服务请求的响应状态。
作为本申请实施例的优选示例一,所述多线程监控子模块可以包括以下子单元:
第一检查线程子单元,用于通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间。
第一预设响应时间阈值判断子单元,用于判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
作为本申请实施例的优选示例二,所述多线程监控子模块可以包括以下子单元:
第二检查线程子单元,用于通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识;若是,则调用第二预设响应时间阈值判断子单元。
第二预设响应时间阈值判断子单元,用于将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
作为本申请实施例的优选示例,所述服务节点响应状态监控模块602可以包括以下子模块:
监控结果轮询子模块,用于在所述监控结果注册列表的起始位置轮询注册的监控结果。
本申请实施例对响应状态异常的服务节点添加异常节点标识,避免了将响应状态异常的服务节点被选作当前服务节点、从而无法实现服务节点切换目的的问题。而且,将异常的服务节点排除,服务节点的切换处理可以保证新的当前服务节点的稳定性,避免了多次服务节点切换引起的系统波动,提升了服务恢复的稳定性。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、 静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包 括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种分布式存储系统的服务节点切换方法和一种分布式存储系统的服务节点切换装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种分布式存储系统的服务节点切换方法,所述服务节点包括当前服务节点和备用服务节点,其特征在于,所述方法包括:
    监控所述服务节点对服务请求的响应状态;
    若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
  3. 根据权利要求2所述的方法,其特征在于,所述触发当前服务节点的切换处理的步骤包括:
    触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
  4. 根据权利要求1所述的方法,其特征在于,所述监控所述服务节点对服务请求的响应状态的步骤包括:
    通过多线程监控所述服务节点对服务请求的响应状态。
  5. 根据权利要求4所述的方法,其特征在于,所述通过多线程监控所述服务节点对服务请求的响应状态的步骤包括:
    通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间;
    判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
  6. 根据权利要求4所述的方法,其特征在于,所述通过多线程监控所述服务节点对服务请求的响应状态的步骤包括:
    通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识;
    若是,则将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
  7. 根据权利要求6所述的方法,其特征在于,在所述通过多线程监控所述服务节点对服务请求的响应状态的步骤之前,所述方法还包括:
    通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间;
    判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
  8. 根据权利要求7所述的方法,其特征在于,还包括:
    若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
  9. 根据权利要求1所述的方法,其特征在于,在所述监控所述服务节点对服务请求的响应状态的步骤之前,所述方法还包括:
    将对至少一个所述服务节点的监控结果注册到监控结果注册列表;
    所述监控所述服务节点对服务请求的响应状态的步骤为:
    在所述监控结果注册列表的起始位置轮询注册的监控结果。
  10. 根据权利要求4所述的方法,其特征在于,还包括:
    将所述多线程的个数除以所述预设响应时间阈值的结果作为监控所述服务节点响应状态的频率。
  11. 一种分布式存储系统的服务节点切换装置,所述服务节点包括当前服务节点和备用服务节点,其特征在于,所述装置包括:
    服务节点响应状态监控模块,用于监控所述服务节点对服务请求的响应状态;
    当前服务节点切换触发模块,用于若所述当前服务节点的响应状态异常,则停止所述当前服务节点与备用服务节点之间的通讯,并触发当前服务节点的切换处理。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    异常节点标识添加模块,用于若某个服务节点的响应状态异常,则对所述服务节点添加用于标记不参与当前服务节点的切换处理的异常节点标识。
  13. 根据权利要求12所述的装置,其特征在于,所述当前服务节点切换触发模块包括:
    触发选取子模块,用于触发选取至少一个没有携带所述异常节点标识的服务节点作为新的当前服务节点,替换响应状态异常的当前服务节点。
  14. 根据权利要求11所述的装置,其特征在于,所述服务节点响应状态监控模块包括:
    多线程监控子模块,用于通过多线程监控所述服务节点对服务请求的响应状态。
  15. 根据权利要求14所述的装置,其特征在于,所述多线程监控子模块包括:
    第一检查线程子单元,用于通过第一检查线程获取所述服务节点从服务请求队列中最近取出服务请求的时间点T1,并将与当前时间点N1的时间间隔D1作为响应时间;
    第一预设响应时间阈值判断子单元,用于判断所述响应时间是否大于第一预设响应时间阈值,若是,则确定所述服务节点的响应状态异常。
  16. 根据权利要求14所述的装置,其特征在于,所述多线程监控子模块包括:
    第二检查线程子单元,用于通过第二检查线程判断所述服务节点的存储单元是否携带存储响应超时标识;若是,则调用第二预设响应时间阈值判断子单元;
    第二预设响应时间阈值判断子单元,用于将所述存储响应超时标识对应的标识添加时间点T2与当前时间点N2的时间间隔D2作为所述响应时间,若所述响应时间大于第二预设响应时间阈值,则确定所述服务节点的响应状态异常。
  17. 根据权利要求16所述的装置,其特征在于,所述装置还包括:
    存储单元响应时间确定模块,用于通过日志记录线程获取所述服务节点的存储单元写日志的起止时间,并将所述起止时间的时间间隔作为存储单元响应时间;
    存储响应超时标识添加模块,用于判断所述存储单元响应时间是否大于第三预设响应时间阈值,若是,则针对所述存储单元添加所述存储响应超时标识,并相应记录添加标识的时间点T2。
  18. 根据权利要求17所述的装置,其特征在于,所述装置还包括:
    存储响应超时标识删除模块,用于若所述存储单元响应时间小于所述第三预设响应时间阈值、且所述存储单元已经携带有所述存储响应超时标识,则删除所述存储响应超时标识。
  19. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    监控结果注册模块,用于将对至少一个所述服务节点的监控结果注册到监控结果注册列表;
    所述服务节点响应状态监控模块包括:
    监控结果轮询子模块,用于在所述监控结果注册列表的起始位置轮询注册的监控结果。
  20. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    监控频率确定模块,用于将所述多线程的个数除以所述预设响应时间阈值的结果作 为监控所述服务节点响应状态的频率。
PCT/CN2016/107422 2015-12-08 2016-12-08 一种分布式存储系统的服务节点切换方法和装置 WO2017097130A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/776,938 US10862740B2 (en) 2015-12-08 2016-12-08 Method and apparatus for switching service nodes in a distributed storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510897877.XA CN106856489B (zh) 2015-12-08 2015-12-08 一种分布式存储系统的服务节点切换方法和装置
CN201510897877.X 2015-12-08

Publications (1)

Publication Number Publication Date
WO2017097130A1 true WO2017097130A1 (zh) 2017-06-15

Family

ID=59013730

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/107422 WO2017097130A1 (zh) 2015-12-08 2016-12-08 一种分布式存储系统的服务节点切换方法和装置

Country Status (3)

Country Link
US (1) US10862740B2 (zh)
CN (1) CN106856489B (zh)
WO (1) WO2017097130A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674060A (zh) * 2019-09-06 2020-01-10 平安普惠企业管理有限公司 服务的熔断控制方法及装置
CN111159009A (zh) * 2019-11-29 2020-05-15 深圳智链物联科技有限公司 一种日志服务系统的压力测试方法及装置
CN111400090A (zh) * 2020-03-04 2020-07-10 江苏蓝创智能科技股份有限公司 一种数据监测系统
CN111858171A (zh) * 2020-07-10 2020-10-30 上海达梦数据库有限公司 一种数据备份方法、装置、设备及存储介质
CN112631711A (zh) * 2019-09-24 2021-04-09 北京金山云网络技术有限公司 容器集群中Master节点的调整方法、装置及服务器
CN115277847A (zh) * 2022-07-27 2022-11-01 阿里巴巴(中国)有限公司 服务处理方法、装置、设备、平台、介质及程序产品

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180183695A1 (en) * 2016-12-28 2018-06-28 Intel Corporation Performance monitoring
CN107360025B (zh) * 2017-07-07 2020-11-10 郑州云海信息技术有限公司 一种分布式存储系统集群监控方法及设备
CN107426038A (zh) * 2017-09-12 2017-12-01 郑州云海信息技术有限公司 一种采集数据的分布式集群系统及数据采集方法
US11178014B1 (en) * 2017-09-28 2021-11-16 Amazon Technologies, Inc. Establishment and control of grouped autonomous device networks
CN109697193A (zh) * 2017-10-24 2019-04-30 中兴通讯股份有限公司 一种确定异常节点的方法、节点及计算机可读存储介质
CN108073460B (zh) * 2017-12-29 2020-12-04 北京奇虎科技有限公司 分布式系统中的全局锁抢占方法、装置及计算设备
CN108933824A (zh) * 2018-06-28 2018-12-04 郑州云海信息技术有限公司 一种保持RabbitMQ服务的方法、系统及相关装置
CN108900356A (zh) * 2018-07-25 2018-11-27 郑州云海信息技术有限公司 一种云服务部署方法和系统
CN110764963B (zh) * 2018-07-28 2023-05-09 阿里巴巴集团控股有限公司 一种服务异常处理方法、装置及设备
CN109407981A (zh) * 2018-09-28 2019-03-01 深圳市茁壮网络股份有限公司 一种数据处理方法及装置
CN109462646B (zh) * 2018-11-12 2021-11-19 平安科技(深圳)有限公司 一种异常响应的方法及设备
CN109584105B (zh) * 2018-11-12 2023-10-17 平安科技(深圳)有限公司 一种服务响应的方法及系统
CN109766210B (zh) * 2019-01-17 2022-04-22 多点生活(成都)科技有限公司 服务熔断控制方法、服务熔断控制装置和服务器集群
CN110297648A (zh) * 2019-06-12 2019-10-01 阿里巴巴集团控股有限公司 应用自动降级和恢复方法和系统
CN110569303B (zh) * 2019-08-19 2020-12-08 杭州衣科信息技术有限公司 一种适用于多种云环境的MySQL应用层高可用系统及方法
CN110691133B (zh) * 2019-09-29 2020-11-24 河南信大网御科技有限公司 一种应用于网络通信设备的web服务拟态系统及方法
CN111124731A (zh) * 2019-12-20 2020-05-08 浪潮电子信息产业股份有限公司 一种文件系统异常监测方法、装置、设备、介质
CN111190538A (zh) * 2019-12-20 2020-05-22 北京淇瑀信息科技有限公司 文件存储方法、系统、设备和计算机可读介质
CN111405077B (zh) * 2020-03-05 2023-02-17 深圳前海百递网络有限公司 域名切换方法、装置、计算机可读存储介质和计算机设备
CN111400263A (zh) * 2020-03-16 2020-07-10 上海英方软件股份有限公司 一种基于文件变化的监控回切方法及装置
CN111381969B (zh) * 2020-03-16 2021-10-26 北京康吉森技术有限公司 一种分布式软件的管理方法及其系统
CN111552441B (zh) * 2020-04-29 2023-02-28 重庆紫光华山智安科技有限公司 数据存储方法和装置、主节点及分布式系统
CN111770154B (zh) * 2020-06-24 2023-12-05 百度在线网络技术(北京)有限公司 服务检测方法、装置、设备以及存储介质
JP7462550B2 (ja) 2020-12-24 2024-04-05 株式会社日立製作所 通信監視対処装置、通信監視対処方法、及び通信監視対処システム
CN113904914A (zh) * 2020-12-31 2022-01-07 京东科技控股股份有限公司 一种服务切换方法、装置、系统和存储介质
CN113055246B (zh) * 2021-03-11 2022-11-22 中国工商银行股份有限公司 异常服务节点识别方法、装置、设备及存储介质
CN113824812B (zh) * 2021-08-27 2023-02-28 济南浪潮数据技术有限公司 一种hdfs服务获取服务节点ip的方法、装置及存储介质
CN113806129A (zh) * 2021-09-16 2021-12-17 北京沃东天骏信息技术有限公司 一种服务请求处理方法和装置
CN114666389B (zh) * 2022-03-14 2024-05-17 京东科技信息技术有限公司 分布式系统中节点状态的检测方法、装置及计算机设备
CN116263727A (zh) * 2022-10-21 2023-06-16 中移(苏州)软件技术有限公司 主备数据库集群及选主方法、计算设备及计算机存储介质
CN116881052B (zh) * 2023-09-07 2023-11-24 上海凯翔信息科技有限公司 一种分布式存储的数据修复系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217402A (zh) * 2008-01-15 2008-07-09 杭州华三通信技术有限公司 一种提高集群可靠性的方法和一种高可靠性通信节点
CN101895447A (zh) * 2010-08-31 2010-11-24 迈普通信技术股份有限公司 Sip中继网关故障监控方法以及sip中继网关
CN102231681A (zh) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 一种高可用集群计算机系统及其故障处理方法
CN102427412A (zh) * 2011-12-31 2012-04-25 网宿科技股份有限公司 基于内容分发网络的零延时主备源灾备切换方法和系统
US20130051220A1 (en) * 2011-08-22 2013-02-28 Igor Ryshakov Method and Apparatus for Quick-Switch Fault Tolerant Backup Channel

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5159691A (en) 1988-03-07 1992-10-27 Sharp Kabushiki Kaisha Master and slave CPU system where the slave changes status bits of a centrol table when modes are changed
US6194969B1 (en) 1999-05-19 2001-02-27 Sun Microsystems, Inc. System and method for providing master and slave phase-aligned clocks
US7269135B2 (en) 2002-04-04 2007-09-11 Extreme Networks, Inc. Methods and systems for providing redundant connectivity across a network using a tunneling protocol
WO2004081762A2 (en) * 2003-03-12 2004-09-23 Lammina Systems Corporation Method and apparatus for executing applications on a distributed computer system
US7526549B2 (en) * 2003-07-24 2009-04-28 International Business Machines Corporation Cluster data port services for clustered computer system
US7340169B2 (en) 2003-11-13 2008-03-04 Intel Corporation Dynamic route discovery for optical switched networks using peer routing
KR20050087182A (ko) * 2004-02-26 2005-08-31 삼성전자주식회사 이중화 장치 및 그 운용방법
DE102004054571B4 (de) * 2004-11-11 2007-01-25 Sysgo Ag Verfahren zur Verteilung von Rechenzeit in einem Rechnersystem
US7808889B1 (en) 2004-11-24 2010-10-05 Juniper Networks, Inc. Silent failover from a primary control unit to a backup control unit of a network device
CN100391162C (zh) * 2005-04-13 2008-05-28 华为技术有限公司 一种切换服务器的控制方法
CN101166303A (zh) * 2006-10-17 2008-04-23 中兴通讯股份有限公司 多媒体广播组播服务会话开始的异常处理系统
CN100568819C (zh) * 2006-10-17 2009-12-09 中兴通讯股份有限公司 多媒体广播组播服务会话开始的异常处理方法
DE102006060081B3 (de) 2006-12-19 2008-06-19 Infineon Technologies Ag Kommunikationsvorrichtung zur Kommunikation nach einer Master-Slave-Kommunikationsvorschrift
CA2630014C (en) 2007-05-18 2014-05-27 Nec Infrontia Corporation Main device redundancy configuration and main device replacing method
KR20100027162A (ko) * 2007-06-26 2010-03-10 톰슨 라이센싱 실시간 프로토콜 스트림 마이그레이션
GB2455702B (en) 2007-11-29 2012-05-09 Ubidyne Inc A method and apparatus for autonomous port role assignments in master-slave networks
US8570898B1 (en) 2008-10-24 2013-10-29 Marvell International Ltd. Method for discovering devices in a wireless network
US8225021B2 (en) 2009-05-28 2012-07-17 Lexmark International, Inc. Dynamic address change for slave devices on a shared bus
CN101729290A (zh) * 2009-11-04 2010-06-09 中兴通讯股份有限公司 用于实现业务系统保护的方法及装置
CN102149205B (zh) 2010-02-09 2016-06-15 中兴通讯股份有限公司 一种中继节点的状态管理方法及系统
US9344494B2 (en) 2011-08-30 2016-05-17 Oracle International Corporation Failover data replication with colocation of session state data
KR101243434B1 (ko) * 2011-10-20 2013-03-13 성균관대학교산학협력단 게이트웨이를 이용한 필드버스 동기화 방법 및 게이트웨이를 이용한 필드버스 동기화 시스템
CN102521339B (zh) * 2011-12-08 2014-11-19 北京京东世纪贸易有限公司 用于动态访问数据源的系统和方法
CN103383689A (zh) * 2012-05-03 2013-11-06 阿里巴巴集团控股有限公司 一种服务进程故障检测方法、装置及服务节点
CN103546590A (zh) * 2013-10-18 2014-01-29 北京奇虎科技有限公司 一种dns服务器的选择方法与装置
US10783030B2 (en) 2014-03-12 2020-09-22 Sensia Llc Network synchronization for master and slave devices
US9489270B2 (en) 2014-07-31 2016-11-08 International Business Machines Corporation Managing backup operations from a client system to a primary server and secondary server
US9760529B1 (en) * 2014-09-17 2017-09-12 Amazon Technologies, Inc. Distributed state manager bootstrapping

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217402A (zh) * 2008-01-15 2008-07-09 杭州华三通信技术有限公司 一种提高集群可靠性的方法和一种高可靠性通信节点
CN101895447A (zh) * 2010-08-31 2010-11-24 迈普通信技术股份有限公司 Sip中继网关故障监控方法以及sip中继网关
CN102231681A (zh) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 一种高可用集群计算机系统及其故障处理方法
US20130051220A1 (en) * 2011-08-22 2013-02-28 Igor Ryshakov Method and Apparatus for Quick-Switch Fault Tolerant Backup Channel
CN102427412A (zh) * 2011-12-31 2012-04-25 网宿科技股份有限公司 基于内容分发网络的零延时主备源灾备切换方法和系统

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674060A (zh) * 2019-09-06 2020-01-10 平安普惠企业管理有限公司 服务的熔断控制方法及装置
CN112631711A (zh) * 2019-09-24 2021-04-09 北京金山云网络技术有限公司 容器集群中Master节点的调整方法、装置及服务器
CN111159009A (zh) * 2019-11-29 2020-05-15 深圳智链物联科技有限公司 一种日志服务系统的压力测试方法及装置
CN111159009B (zh) * 2019-11-29 2024-03-12 深圳智链物联科技有限公司 一种日志服务系统的压力测试方法及装置
CN111400090A (zh) * 2020-03-04 2020-07-10 江苏蓝创智能科技股份有限公司 一种数据监测系统
CN111858171A (zh) * 2020-07-10 2020-10-30 上海达梦数据库有限公司 一种数据备份方法、装置、设备及存储介质
CN111858171B (zh) * 2020-07-10 2024-03-12 上海达梦数据库有限公司 一种数据备份方法、装置、设备及存储介质
CN115277847A (zh) * 2022-07-27 2022-11-01 阿里巴巴(中国)有限公司 服务处理方法、装置、设备、平台、介质及程序产品

Also Published As

Publication number Publication date
US20180331888A1 (en) 2018-11-15
US10862740B2 (en) 2020-12-08
CN106856489B (zh) 2020-09-08
CN106856489A (zh) 2017-06-16

Similar Documents

Publication Publication Date Title
WO2017097130A1 (zh) 一种分布式存储系统的服务节点切换方法和装置
US10838777B2 (en) Distributed resource allocation method, allocation node, and access node
TWI685226B (zh) 分散式環境下的服務定址方法及裝置
US10884623B2 (en) Method and apparatus for upgrading a distributed storage system
JP5714571B2 (ja) キャッシュクラスタを構成可能モードで用いるキャッシュデータ処理
CA2533737C (en) Fast application notification in a clustered computing system
US8930527B2 (en) High availability enabler
US11330071B2 (en) Inter-process communication fault detection and recovery system
JP2002091938A (ja) フェールオーバを処理するシステムおよび方法
CN106034137A (zh) 用于分布式系统的智能调度方法及分布式服务系统
US9652307B1 (en) Event system for a distributed fabric
CN113641511A (zh) 一种消息通信方法和装置
CN107872517B (zh) 一种数据处理方法及装置
CN111865632B (zh) 分布式数据存储集群的切换方法及切换指令发送方法和装置
US11397632B2 (en) Safely recovering workloads within a finite timeframe from unhealthy cluster nodes
WO2018072561A1 (zh) 一种视频切换方法、装置及视频巡逻系统
CN109582459A (zh) 应用的托管进程进行迁移的方法及装置
CN115001956B (zh) 服务器集群的运行方法、装置、设备及存储介质
US10348814B1 (en) Efficient storage reclamation for system components managing storage
TWI740885B (zh) 分布式儲存系統的服務節點切換方法及裝置
US20050234919A1 (en) Cluster system and an error recovery method thereof
CN113867915A (zh) 任务调度方法、电子设备及存储介质
US10701167B1 (en) Adaptive quorum for a message broker service
US10110502B1 (en) Autonomous host deployment in managed deployment systems
US20180309702A1 (en) Method and device for processing data after restart of node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16834228

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15776938

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16834228

Country of ref document: EP

Kind code of ref document: A1