CN114443438A - Node state detection method, node abnormity processing method and device - Google Patents

Node state detection method, node abnormity processing method and device Download PDF

Info

Publication number
CN114443438A
CN114443438A CN202210111973.7A CN202210111973A CN114443438A CN 114443438 A CN114443438 A CN 114443438A CN 202210111973 A CN202210111973 A CN 202210111973A CN 114443438 A CN114443438 A CN 114443438A
Authority
CN
China
Prior art keywords
target
node
detection
operation data
exception handling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210111973.7A
Other languages
Chinese (zh)
Inventor
董善义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210111973.7A priority Critical patent/CN114443438A/en
Publication of CN114443438A publication Critical patent/CN114443438A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种节点状态的检测方法、节点异常的处理方法及装置。该方法包括:接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及所述目标检测指标对应的运行数据,其中,所述目标节点为目标检测指标处于异常状态的节点;根据所述运行数据确定所述目标检测指标对应的目标异常类型;查询所述目标异常类型对应的目标异常处理脚本;在存在所述目标异常处理脚本的情况下,执行所述目标异常处理脚本以使所述目标节点的目标检测指标恢复至正常状态。本申请提供的方法能够在目标集群中节点出现异常时,根据该节点出现异常的异常类型自动获取并执行异常处理脚本,实现了节点在出现异常后及时自愈,保证目标集群持续处于高可用的状态。

Figure 202210111973

The present application discloses a method for detecting node status, a method and device for processing node abnormality. The method includes: receiving a target detection indicator sent by an agent terminal deployed on a target node in a target cluster, and operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state; according to The operation data determines the target exception type corresponding to the target detection indicator; queries the target exception handling script corresponding to the target exception type; if the target exception handling script exists, executes the target exception handling script to make The target detection index of the target node is restored to a normal state. The method provided by the present application can automatically obtain and execute an exception handling script according to the abnormality type of the node when an abnormality occurs in a node in the target cluster, so as to realize the timely self-healing of the node after the abnormality occurs, and ensure that the target cluster continues to be in a high-availability state. state.

Figure 202210111973

Description

节点状态的检测方法、节点异常的处理方法及装置Node state detection method, node abnormality processing method and device

技术领域technical field

本申请涉及计算机技术领域,尤其涉及节点状态的检测方法、节点异常的处理方法及装置。The present application relates to the field of computer technology, and in particular, to a method for detecting node status, a method and device for processing node abnormality.

背景技术Background technique

目前k8s官方已提供node节点的故障探测功能,但是如上文所说,其探测功能在实际的生产中还是会出现很多问题,且不具备节点自愈的能力。腾讯在k8s官方提供的node节点故障检测功能的基础上增加了节点自愈能力,但是其节点自愈能力只是针对表面现象做重启操作,实际生产中发现,这种简单的重启是无法解决问题的,因为一个组件启动失败往往再重启它也不会启动成功。At present, k8s has officially provided the fault detection function of node nodes, but as mentioned above, its detection function will still cause many problems in actual production, and it does not have the ability of node self-healing. Tencent has added node self-healing capability on the basis of the node node fault detection function officially provided by k8s, but its node self-healing capability is only a restart operation for superficial phenomena. In actual production, it is found that this simple restart cannot solve the problem. , because a component fails to start, often restarting it will not start successfully.

发明内容SUMMARY OF THE INVENTION

为了解决上述技术问题或者至少部分地解决上述技术问题,本申请提供了一种节点状态的检测方法、节点异常的处理方法及装置。In order to solve the above-mentioned technical problems or at least partially solve the above-mentioned technical problems, the present application provides a method for detecting a node state, a method and an apparatus for processing a node abnormality.

根据本申请实施例的一个方面,还提供了一种节点异常的处理方法,该方法应用于部署在目标集群中主节点上的控制器,所述方法包括:According to an aspect of the embodiments of the present application, a method for processing a node abnormality is also provided. The method is applied to a controller deployed on a master node in a target cluster, and the method includes:

接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及所述目标检测指标对应的运行数据,其中,所述目标节点为目标检测指标处于异常状态的节点;receiving the target detection indicator sent by the proxy terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state;

根据所述运行数据确定所述目标检测指标对应的目标异常类型;Determine the target abnormality type corresponding to the target detection indicator according to the operating data;

查询所述目标异常类型对应的目标异常处理脚本;query the target exception handling script corresponding to the target exception type;

在存在所述目标异常处理脚本的情况下,执行所述目标异常处理脚本以使所述目标节点的目标检测指标恢复至正常状态。If the target exception handling script exists, the target exception handling script is executed to restore the target detection index of the target node to a normal state.

进一步的,所述查询所述目标异常类型对应的异常处理脚本,包括:Further, the querying the exception handling script corresponding to the target exception type includes:

从所述控制器的缓存中读取异常类型与异常处理脚本之间的映射关系;Read the mapping relationship between the exception type and the exception handling script from the cache of the controller;

基于所述映射关系,获取所述目标异常类型对应的目标异常处理脚本。Based on the mapping relationship, a target exception handling script corresponding to the target exception type is acquired.

进一步的,在不存在所述异常处理脚本的情况下,所述方法还包括:Further, in the absence of the exception handling script, the method further includes:

向目标客户端发送所述目标检测指标,以及所述目标检测指标对应的运行数据;sending the target detection indicator and the operation data corresponding to the target detection indicator to the target client;

接收所述目标客户端基于所述目标检测指标以及所述运行数据反馈的目标异常处理脚本;receiving a target exception handling script fed back by the target client based on the target detection indicator and the operating data;

建立所述目标异常处理脚本与所述目标异常类型之间的映射关系,并将所述映射关系存储至所述控制器的缓存。A mapping relationship between the target exception handling script and the target exception type is established, and the mapping relationship is stored in the cache of the controller.

根据本申请实施例的另一个方面,提供了一种节点状态的检测方法,该方法应用于代理终端,所述代理终端部署在目标集群中的每个节点,所述方法包括:According to another aspect of the embodiments of the present application, a method for detecting a node state is provided. The method is applied to a proxy terminal, where the proxy terminal is deployed on each node in a target cluster, and the method includes:

按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据;Periodically detect the node according to the detection strategy corresponding to the detection index, and obtain the operation data corresponding to each detection index in the node;

基于所述运行数据确定所述检测指标对应的状态信息;Determine the state information corresponding to the detection index based on the operating data;

将所述状态信息为异常状态的检测指标确定为目标检测指标;Determining the detection index whose state information is an abnormal state as the target detection index;

向控制器发送所述目标检测指标对应的运行数据,以使所述控制器根据所述运行数据对所述目标节点执行异常处理操作。The operation data corresponding to the target detection indicator is sent to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data.

进一步的,在所述检测指标为网络指标的情况下,所述按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据,包括:Further, when the detection indicator is a network indicator, the node is periodically detected according to the detection strategy corresponding to the detection indicator, and the operation data corresponding to each detection indicator in the node is obtained, including:

确定所述网络指标对应的网络检测策略;determining a network detection strategy corresponding to the network indicator;

利用所述网络检测策略检测所述节点所在管理网、业务网以及存储网分别对应的网络参数;Use the network detection strategy to detect network parameters corresponding to the management network, service network and storage network where the node is located;

将所述管理网、业务网以及存储网分别对应的网络参数确定为所述网络指标对应的运行数据。The network parameters corresponding to the management network, the service network and the storage network respectively are determined as the operation data corresponding to the network indicators.

进一步的,在所述检测指标为组件指标的情况下,所述按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据,包括:Further, in the case where the detection index is a component index, the node is periodically detected according to the detection strategy corresponding to the detection index, and the operation data corresponding to each detection index in the node is obtained, including:

确定所述组件指标对应的组件检测策略;determining the component detection strategy corresponding to the component indicator;

利用所述组件检测策略查询所述节点的日志文件,并从所述日志文件中统计所述节点中组件的运行数据。The log file of the node is queried by using the component detection strategy, and the running data of the components in the node is counted from the log file.

根据本申请实施例的另一个方面,还提供了一种节点异常的处理装置,包括:According to another aspect of the embodiments of the present application, a device for processing node exceptions is also provided, including:

接收模块,用于接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及所述目标检测指标对应的运行数据,其中,所述目标节点为目标检测指标处于异常状态的节点;a receiving module, configured to receive the target detection indicator sent by the proxy terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state;

确定模块,用于根据所述运行数据确定所述目标检测指标对应的目标异常类型;a determining module, configured to determine the target abnormality type corresponding to the target detection index according to the operating data;

查询模块,用于查询所述目标异常类型对应的目标异常处理脚本;a query module for querying the target exception handling script corresponding to the target exception type;

执行模块,用于在存在所述目标异常处理脚本的情况下,执行所述目标异常处理脚本以使所述目标节点的目标检测指标恢复至正常状态。The execution module is configured to execute the target exception handling script in the presence of the target exception handling script to restore the target detection index of the target node to a normal state.

根据本申请实施例的另一个方面,还提供了一种节点状态的检测装置,包括:According to another aspect of the embodiments of the present application, a device for detecting a node state is also provided, including:

检测模块,用于按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据;a detection module, configured to periodically detect the node according to the detection strategy corresponding to the detection index, and obtain operation data corresponding to each detection index in the node;

确定模块,用于基于所述运行数据确定所述检测指标对应的状态信息;a determination module, configured to determine the state information corresponding to the detection index based on the operation data;

处理模块,用于将所述状态信息为异常状态的检测指标确定为目标检测指标;a processing module, configured to determine a detection index whose state information is an abnormal state as a target detection index;

发送模块,用于向控制器发送所述目标检测指标对应的运行数据,以使所述控制器根据所述运行数据对所述目标节点执行异常处理操作。The sending module is configured to send the operation data corresponding to the target detection indicator to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data.

根据本申请实施例的另一方面,还提供了一种存储介质,该存储介质包括存储的程序,程序运行时执行上述的步骤。According to another aspect of the embodiments of the present application, a storage medium is also provided, where the storage medium includes a stored program, and the above steps are executed when the program is run.

根据本申请实施例的另一方面,还提供了一种电子装置,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;其中:存储器,用于存放计算机程序;处理器,用于通过运行存储器上所存放的程序来执行上述方法中的步骤。According to another aspect of the embodiments of the present application, an electronic device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; wherein: The memory is used to store the computer program; the processor is used to execute the steps in the above method by running the program stored in the memory.

本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述方法中的步骤。Embodiments of the present application also provide a computer program product containing instructions, which, when run on a computer, cause the computer to execute the steps in the above method.

本申请实施例提供的上述技术方案与现有技术相比具有如下优点:本申请提供的方法能够在目标集群中节点出现异常时,根据该节点出现异常的异常类型自动获取并执行异常处理脚本,实现了节点在出现异常后及时自愈,保证目标集群持续处于高可用的状态。Compared with the prior art, the above technical solutions provided by the embodiments of the present application have the following advantages: the method provided by the present application can automatically acquire and execute an exception handling script according to the abnormality type of the node when an abnormality occurs in a node in the target cluster, It realizes the timely self-healing of the node after an abnormality occurs, and ensures that the target cluster is continuously in a high-availability state.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. In other words, on the premise of no creative labor, other drawings can also be obtained from these drawings.

图1为本申请实施例提供的一种节点状态的检测方法的流程图;FIG. 1 is a flowchart of a method for detecting a node state provided by an embodiment of the present application;

图2为本申请实施例提供的一种目标集群的示意图;FIG. 2 is a schematic diagram of a target cluster provided by an embodiment of the present application;

图3为本申请实施例提供的一种节点异常的处理方法的流程图;3 is a flowchart of a method for processing node exceptions provided by an embodiment of the present application;

图4为本申请实施例提供的一种节点状态的检测装置的框图;4 is a block diagram of an apparatus for detecting a node state provided by an embodiment of the present application;

图5为本申请实施例提供的一种节点异常的处理装置的框图;FIG. 5 is a block diagram of an apparatus for processing node exceptions provided by an embodiment of the present application;

图6为本申请实施例提供的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请的一部分实施例,而不是全部的实施例,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of the present application, but not all of the embodiments. The exemplary embodiments of the present application and their descriptions are used to explain the present application and do not constitute improper limitations on the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present application.

需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个类似的实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, herein, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another similar entity or operation, and do not necessarily require or Any such actual relationship or order between these entities or operations is implied. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本申请实施例提供了一种节点状态的检测方法、节点异常的处理方法及装置。本发明实施例所提供的方法可以应用于任意需要的电子设备,例如,可以为服务器、终端等电子设备,在此不做具体限定,为描述方便,后续简称为电子设备。Embodiments of the present application provide a method for detecting node status, a method and device for processing node abnormality. The methods provided in the embodiments of the present invention can be applied to any required electronic devices, for example, electronic devices such as servers and terminals, which are not specifically limited here, and are simply referred to as electronic devices hereinafter for the convenience of description.

根据本申请实施例的一方面,提供了一种节点状态的检测方法的方法实施例。图1为本申请实施例提供的一种节点状态的检测方法的流程图,如图1所示,该方法包括:According to an aspect of the embodiments of the present application, a method embodiment of a method for detecting a node state is provided. FIG. 1 is a flowchart of a method for detecting a node state provided by an embodiment of the present application. As shown in FIG. 1 , the method includes:

步骤S11,按照检测指标对应的检测策略对节点进行周期性检测,得到节点中各个检测指标对应的运行数据。Step S11: Periodically detect the node according to the detection strategy corresponding to the detection index, and obtain operation data corresponding to each detection index in the node.

本申请实施例提供的方法应用于代理终端,代理终端部署在目标集群中的每个节点,如图2所示,每个节点node上部署有一个代理终端agent,代理终端用于监控其所在节点的运行数据,例如运行数据可以是:网络参数,组件参数,内存参数等等。The method provided by the embodiment of the present application is applied to an agent terminal, and the agent terminal is deployed on each node in the target cluster. As shown in FIG. 2 , an agent terminal agent is deployed on each node node, and the agent terminal is used to monitor the node where it is located. The running data, for example, the running data can be: network parameters, component parameters, memory parameters and so on.

在本申请实施例中,在检测指标为网络指标的情况下,按照检测指标对应的检测策略对节点进行周期性检测,得到节点中各个检测指标对应的运行数据,包括以下步骤A1-A3:In the embodiment of the present application, when the detection index is a network index, periodic detection is performed on the node according to the detection strategy corresponding to the detection index, and operation data corresponding to each detection index in the node is obtained, including the following steps A1-A3:

步骤A1,确定网络指标对应的网络检测策略。Step A1: Determine a network detection strategy corresponding to the network indicator.

在本申请实施例中,网络指标对应的网络检测策略为:使用Gossip协议对节点对应的管理网、业务网以及存储网进行检测。需要说明的是,利用Gossip协议节点之间能够互相保证对方网络的可达性。In the embodiment of the present application, the network detection strategy corresponding to the network indicator is: use the Gossip protocol to detect the management network, service network, and storage network corresponding to the node. It should be noted that nodes using the Gossip protocol can mutually ensure the reachability of each other's network.

步骤A2,利用网络检测策略检测节点所在管理网、业务网以及存储网分别对应的网络参数。In step A2, network parameters corresponding to the management network, service network and storage network where the node is located are detected by using the network detection strategy.

步骤A3,将管理网、业务网以及存储网分别对应的网络参数确定为网络指标对应的运行数据。Step A3: Determine the network parameters corresponding to the management network, the service network and the storage network respectively as the operation data corresponding to the network indicators.

在本申请实施例中,代理终端首先查询管理网对应的第一节点,然后基于Gossip协议向第一节点发送消息,确定代理终端对应节点所在管理网的第一网络参数。代理终端首先查询业务网对应的第二节点,然后基于Gossip协议向第二节点发送消息,确定代理终端对应节点所在管理网的第二网络参数。代理终端首先查询存储网对应的第三节点,然后基于Gossip协议向第三节点发送消息,确定代理终端对应节点所在存储网的第三网络参数。然后可以基于上述管理网、业务网以及存储网对应分别对应的网络参数,作为网络指标对应的运行数据。网络参数可以包括传输速率、传输延迟、丢包率等等。In the embodiment of the present application, the proxy terminal first queries the first node corresponding to the management network, and then sends a message to the first node based on the Gossip protocol to determine the first network parameters of the management network where the node corresponding to the proxy terminal is located. The proxy terminal first queries the second node corresponding to the service network, and then sends a message to the second node based on the Gossip protocol to determine the second network parameters of the management network where the node corresponding to the proxy terminal is located. The proxy terminal first queries the third node corresponding to the storage network, and then sends a message to the third node based on the Gossip protocol to determine third network parameters of the storage network where the node corresponding to the proxy terminal is located. Then, the network parameters corresponding to the above-mentioned management network, service network, and storage network can be used as the operation data corresponding to the network indicators. Network parameters may include transmission rate, transmission delay, packet loss rate, and the like.

作为一个示例,每个代理终端使用Gossip协议进行通信,每隔一段时间,每个节点中的代理终端都会随机选择几个节点发送Gossip消息,其他节点会再次随机选择其他几个节点发送Gossip消息。这样一段时间过后,整个集群的节点都能收到这条Gossip消息。每个节点可能知道所有其他节点,也可能仅知道几个邻居节点,只要这些节点可以通过网络连通,最终所有节点学到的consul集群状态都是一致的。As an example, each proxy terminal communicates using the Gossip protocol. At intervals, the proxy terminal in each node will randomly select several nodes to send Gossip messages, and other nodes will randomly select other nodes to send Gossip messages again. After such a period of time, the nodes of the entire cluster can receive this Gossip message. Each node may know all other nodes, or only a few neighbor nodes. As long as these nodes can be connected through the network, the consul cluster state learned by all nodes will be consistent in the end.

在本申请实施例中,在检测指标为组件指标的情况下,按照检测指标对应的检测策略对节点进行周期性检测,得到节点中各个检测指标对应的运行数据,包括以下步骤B1-B2:In the embodiment of the present application, when the detection index is a component index, periodic detection is performed on the node according to the detection strategy corresponding to the detection index, and operation data corresponding to each detection index in the node is obtained, including the following steps B1-B2:

步骤B1,确定组件指标对应的组件检测策略。Step B1: Determine the component detection strategy corresponding to the component index.

步骤B2,利用组件检测策略查询节点的日志文件,并从日志文件中统计节点中组件的运行数据。Step B2, query the log file of the node by using the component detection strategy, and count the running data of the components in the node from the log file.

在本申请实施例中,代理终端可以通过检测系统日志的方式,实时监控关键组件的运行数据。另外还可以每间隔预设时间接收kubelet组件的心跳数据,在预设时间内未收到kubelet组件发送的心跳数据时,对kubelet组件进行健康探测,获取该kubelet组件的运行数据。关键组件的运行数据包括:kubelet组件的重启次数,kubelet组件的健康信息等等。In this embodiment of the present application, the proxy terminal can monitor the running data of key components in real time by detecting system logs. In addition, you can also receive the heartbeat data of the kubelet component at preset time intervals, and when the heartbeat data sent by the kubelet component is not received within the preset time, perform health detection on the kubelet component to obtain the running data of the kubelet component. The running data of key components includes: the number of restarts of the kubelet component, the health information of the kubelet component, and so on.

在本申请实施例中,在检测指标为磁盘指标的情况下,通过每隔一段时间执行一个df–h和iostat命令的方式,监控系统磁盘的使用情况,基于使用情况确定系统磁盘的磁盘容量等等。另外,在检测指标为内存指标的情况下,可以监控内存的使用情况,从而确定内存的占用率。In the embodiment of the present application, when the detection index is a disk index, the usage of the system disk is monitored by executing a df-h and iostat command at regular intervals, and the disk capacity of the system disk is determined based on the usage. Wait. In addition, when the detection indicator is the memory indicator, the usage of the memory can be monitored to determine the usage rate of the memory.

步骤S12,基于运行数据确定检测指标对应的状态信息。In step S12, the state information corresponding to the detection index is determined based on the operation data.

在本申请实施例中,可以将运行数据与预设运行数据进行对比,预设运行数据为该节点各个检测指标在正常状态下的数值范围,或者是工作人员预先设定的上限值或下限值等等。如果运行数据与预设运行数据相匹配,则确定状态信息为正常状态。如果运行数据与预设运行数据不匹配,则确定状态信息为异常状态。In the embodiment of the present application, the operation data can be compared with the preset operation data, and the preset operation data is the numerical range of each detection index of the node in a normal state, or the upper limit or lower limit preset by the staff limits, etc. If the operation data matches the preset operation data, it is determined that the state information is a normal state. If the operation data does not match the preset operation data, it is determined that the state information is an abnormal state.

作为一个示例,如果系统磁盘的磁盘容量大于预设容量,则确定系统磁盘的状态信息为异常状态。如果管理网对应的传输速率小于预设传输速率,则确定节点所在管理网不通,此时管理网对应的状态信息为异常状态。如果组件对应的重启次数大于预设次数,或者健康信息与预设健康信息不匹配,则确定组件的状态信息为异常状态。As an example, if the disk capacity of the system disk is greater than the preset capacity, it is determined that the state information of the system disk is an abnormal state. If the transmission rate corresponding to the management network is less than the preset transmission rate, it is determined that the management network where the node is located is unreachable, and the status information corresponding to the management network is an abnormal state at this time. If the number of restarts corresponding to the component is greater than the preset number of times, or the health information does not match the preset health information, it is determined that the state information of the component is an abnormal state.

步骤S13,将状态信息为异常状态的检测指标确定为目标检测指标。In step S13, the detection index whose state information is an abnormal state is determined as the target detection index.

在本申请实施例中,代理终端在依据运行数据确定各个检测指标对应的状态信息后,会统计出各个检测指标对应的状态信息,并将状态信息为异常状态的检测指标确定为目标检测指标。In the embodiment of the present application, after determining the state information corresponding to each detection index according to the operation data, the proxy terminal will count the state information corresponding to each detection index, and determine the detection index whose state information is an abnormal state as the target detection index.

步骤S14,向控制器发送目标检测指标对应的运行数据,以使控制器根据运行数据对目标节点执行异常处理操作。Step S14, sending the operation data corresponding to the target detection index to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data.

在本申请实施例中,代理终端在确定目标检测指标后,会将其所在节点的目标检测指标和运行数据发送至控制器,以使控制器根据运行数据确定目标检测指标的异常原因,并根据异常原因对该节点进行异常处理,从而使该节点的目标检测指标恢复正常状态。In the embodiment of the present application, after determining the target detection index, the agent terminal sends the target detection index and operation data of the node where it is located to the controller, so that the controller can determine the abnormal cause of the target detection index according to the operation data, and according to the operation data The abnormal cause is abnormally processed for the node, so that the target detection index of the node returns to the normal state.

根据本申请实施例的另一个方面,还提供了一种节点异常的处理方法,图3为本申请实施例提供的一种节点异常的处理方法的流程图,如图3所示,该方法可以包括以下步骤:According to another aspect of the embodiments of the present application, a method for processing node exceptions is also provided. FIG. 3 is a flowchart of a method for processing node exceptions provided by the embodiments of the present application. As shown in FIG. 3 , the method may Include the following steps:

步骤S21,接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及目标检测指标对应的运行数据,其中,目标节点为目标检测指标处于异常状态的节点。Step S21: Receive the target detection indicator sent by the proxy terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state.

本申请实施例提供的方法应用于部署在目标集群中主节点上的控制器,控制器会接收部署在目标集群中各个代理终端发送的目标检测指标,以及目标检测指标对应的运行数据,其中,每个代理终端对应一个目标节点。The method provided by this embodiment of the present application is applied to a controller deployed on a master node in a target cluster, and the controller receives target detection indicators sent by each agent terminal deployed in the target cluster, and operation data corresponding to the target detection indicators, wherein, Each agent terminal corresponds to a target node.

需要说明的是,目标集群中存在多个节点,每个节点中部署一个代理终端,代理终端会对其所在节点进行监控,如果节点的某一个检测指标出现异常后,代理终端会将出现异常的检测指标确定为目标检测指标。因此控制器在接收到代理终端发送的目标检测指标后,会将该代理终端所在的节点确定为目标节点。It should be noted that there are multiple nodes in the target cluster, and an agent terminal is deployed in each node. The agent terminal will monitor the node where it is located. If one of the detection indicators of the node is abnormal, the agent terminal will appear abnormal. The detection index is determined as the target detection index. Therefore, after receiving the target detection indicator sent by the proxy terminal, the controller will determine the node where the proxy terminal is located as the target node.

步骤S22,根据运行数据确定目标检测指标对应的目标异常类型。Step S22, determining the target abnormality type corresponding to the target detection index according to the operation data.

在本申请实施例中,控制器中存储有各个检测指标对应的异常分类条件,例如:当运行数据为组件的重启次数时,此时目标异常类型为组件重启频繁。或运行数据为组件的健康信息时,此时目标异常类型为Kubelet异常或Docker异常。或者,当运行数据为管理网,和/或业务网,和/或存储网的网络参数时,此时可以根据网络参数确定管理网,和/或业务网,和/或存储网的目标异常类型为网络不通。In the embodiment of the present application, the abnormality classification conditions corresponding to each detection index are stored in the controller. For example, when the operating data is the number of restarts of the component, the target abnormality type is frequent restarting of the component. Or when the running data is the health information of the component, the target exception type is Kubelet exception or Docker exception. Alternatively, when the operating data is the network parameters of the management network, and/or the service network, and/or the storage network, the target abnormality type of the management network, and/or the service network, and/or the storage network can be determined according to the network parameters at this time. for network failure.

步骤S23,查询目标异常类型对应的目标异常处理脚本。Step S23, query the target exception processing script corresponding to the target exception type.

在本申请实施例中,步骤S23,查询目标异常类型对应的异常处理脚本,包括以下步骤C1-C2:In the embodiment of the present application, step S23, querying the exception handling script corresponding to the target exception type, includes the following steps C1-C2:

步骤C1,从控制器的缓存中读取异常类型与异常处理脚本之间的映射关系。In step C1, the mapping relationship between the exception type and the exception handling script is read from the cache of the controller.

在本申请实施例中,控制器的缓存中存储有异常类型与异常处理脚本之间的映射关系,其中,一个异常类型可以对应至少一个异常处理脚本。具体的,可以是控制器将异常类型发送至目标客户端,目标客户端会根据异常类型向控制器反馈异常处理脚本,然后控制器建立异常处理脚本与异常类型之间的映射关系。In this embodiment of the present application, the cache of the controller stores a mapping relationship between exception types and exception handling scripts, where one exception type may correspond to at least one exception handling script. Specifically, the controller may send the exception type to the target client, the target client will feed back the exception handling script to the controller according to the exception type, and then the controller establishes a mapping relationship between the exception handling script and the exception type.

需要说明的是,异常处理脚本可以是用户通过目标客户端在捕捉到这个异常后,会去解决此异常,解决异常后,用户通过目标客户端将解决过程编写成一个脚本,将脚本按格式导入口文件中,生成最终的异常处理脚本。It should be noted that the exception handling script can be used by the user to resolve the exception after catching the exception through the target client. After the exception is resolved, the user writes the resolution process into a script through the target client, and imports the script in the format. In the port file, the final exception handling script is generated.

步骤C2,基于映射关系,获取目标异常类型对应的目标异常处理脚本。Step C2, based on the mapping relationship, obtain the target exception handling script corresponding to the target exception type.

在本申请实施例中,在查询过程中,控制器可以基于建立的映射关系,获取目标异常类型对应的目标异常处理脚本。因此本申请通过预先建立异常类型与异常处理脚本之间的映射关系,能够在目标集群中的某一个节点出现异常时,根据该节点出现异常的异常类型自动获取相应的异常处理脚本,不再需要人工手动解决异常,有利于保证目标集群处于持续处于高可用状态。In this embodiment of the present application, during the query process, the controller may obtain the target exception handling script corresponding to the target exception type based on the established mapping relationship. Therefore, by pre-establishing the mapping relationship between the exception type and the exception handling script, the present application can automatically obtain the corresponding exception handling script according to the abnormality type of the node when an exception occurs in a certain node in the target cluster, which is no longer required. Manually resolving exceptions helps ensure that the target cluster is continuously in a high-availability state.

步骤S24,在存在目标异常处理脚本的情况下,执行目标异常处理脚本以使目标节点的目标检测指标恢复至正常状态。Step S24 , if the target exception handling script exists, execute the target exception handling script to restore the target detection index of the target node to a normal state.

本申请实施例中,在存在目标异常处理脚本的情况下,控制器会执行目标异常处理脚本,从而使目标节点的目标检测指标恢复正常,从而实现节点自愈。In the embodiment of the present application, when there is a target exception handling script, the controller will execute the target exception handling script, so that the target detection index of the target node returns to normal, thereby realizing self-healing of the node.

在本申请实施例中,步骤S24,在不存在异常处理脚本的情况下,方法还包括以下步骤D1-D3:In this embodiment of the present application, in step S24, in the absence of an exception handling script, the method further includes the following steps D1-D3:

步骤D1,向目标客户端发送目标检测指标,以及目标检测指标对应的运行数据。In step D1, the target detection indicator and the operation data corresponding to the target detection indicator are sent to the target client.

步骤D2,接收目标客户端基于目标检测指标以及运行数据反馈的目标异常处理脚本。Step D2: Receive a target exception handling script fed back by the target client based on the target detection index and the running data.

步骤D3,建立目标异常处理脚本与目标异常类型之间的映射关系,并将映射关系存储至控制器的缓存。Step D3, establishing a mapping relationship between the target exception handling script and the target exception type, and storing the mapping relationship in the cache of the controller.

在本申请实施例中,在存在目标异常处理脚本的情况下,控制器会向目标客户端发送目标检测指标以及目标检测指标对应的运行数据,目标客户端将目标检测指标和运行数据进行显示,以使用户根据显示的目标检测指标和运行数据编写相应的目标异常处理脚本。用户在编写完成后,可以将目标异常处理脚本通过目标客户端发送至目标集群中的控制器。In the embodiment of the present application, when there is a target exception handling script, the controller will send the target detection indicator and the operation data corresponding to the target detection indicator to the target client, and the target client will display the target detection indicator and the operation data, So that users can write corresponding target exception handling scripts according to the displayed target detection indicators and running data. After writing, the user can send the target exception handling script to the controller in the target cluster through the target client.

在本申请实施例中,控制器在接收到目标异常处理脚本后,会执行目标异常处理脚本,从而使节点的目标检测指标恢复至正常状态。同时还会存储目标异常处理脚本,并建立目标异常处理脚本与目标异常类型之间的映射关系,并将映射关系存储至控制器的缓存中。In the embodiment of the present application, after receiving the target exception handling script, the controller executes the target exception handling script, thereby restoring the target detection index of the node to a normal state. At the same time, the target exception handling script is also stored, and the mapping relationship between the target exception handling script and the target exception type is established, and the mapping relationship is stored in the cache of the controller.

本申请实施例提供方法,在控制器中不存在异常处理脚本的情况下,控制器会自动向目标客户端发送目标检测指标和运行数据。以使用户依据目标检测指标和运行数据编写目标异常处理脚本,然后将目标异常处理脚本反馈给控制器,控制器会执行并存储目标异常处理脚本,从而保证后续节点在出现同样的异常时,自动执行异常处理操作。The embodiments of the present application provide a method. In the case where there is no exception handling script in the controller, the controller will automatically send target detection indicators and operation data to the target client. So that the user can write the target exception handling script according to the target detection indicators and operating data, and then feed the target exception handling script to the controller. Perform exception handling operations.

图4为本申请实施例提供的一种节点状态的检测装置的框图,该装置可以通过软件、硬件或者两者的结合实现成为电子设备的部分或者全部。如图4所示,该装置包括:FIG. 4 is a block diagram of an apparatus for detecting a node state provided by an embodiment of the present application, and the apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of the two. As shown in Figure 4, the device includes:

检测模块31,用于按照检测指标对应的检测策略对节点进行周期性检测,得到节点中各个检测指标对应的运行数据;The detection module 31 is configured to periodically detect the node according to the detection strategy corresponding to the detection index, and obtain the operation data corresponding to each detection index in the node;

确定模块32,用于基于运行数据确定检测指标对应的状态信息;A determination module 32, configured to determine the state information corresponding to the detection index based on the operation data;

处理模块33,用于将状态信息为异常状态的检测指标确定为目标检测指标;The processing module 33 is used to determine the detection index whose state information is an abnormal state as the target detection index;

发送模块34,用于向控制器发送目标检测指标对应的运行数据,以使控制器根据运行数据对目标节点执行异常处理操作。The sending module 34 is configured to send the operation data corresponding to the target detection indicator to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data.

在本申请实施例中,在检测指标为网络指标的情况下,检测模块31,用于确定网络指标对应的网络检测策略;利用网络检测策略检测节点所在管理网、业务网以及存储网分别对应的网络参数;将管理网、业务网以及存储网分别对应的网络参数确定为网络指标对应的运行数据。In the embodiment of the present application, when the detection index is a network index, the detection module 31 is used to determine a network detection strategy corresponding to the network index; use the network detection strategy to detect the corresponding management network, service network and storage network where the node is located, respectively Network parameters; network parameters corresponding to the management network, service network, and storage network are determined as the operation data corresponding to the network indicators.

在本申请实施例中,在检测指标为组件指标的情况下,检测模块31,用于确定组件指标对应的组件检测策略;利用组件检测策略查询节点的日志文件,并从日志文件中统计节点中组件的运行数据。In the embodiment of the present application, when the detection index is a component index, the detection module 31 is used to determine the component detection strategy corresponding to the component index; query the log file of the node by using the component detection strategy, and collect statistics on the nodes from the log file. The operational data of the component.

图5为本申请实施例提供的一种节点异常的处理装置的框图,该装置可以通过软件、硬件或者两者的结合实现成为电子设备的部分或者全部。如图5所示,该装置包括:FIG. 5 is a block diagram of an apparatus for processing node exceptions provided by an embodiment of the present application, and the apparatus may be implemented as part or all of an electronic device through software, hardware, or a combination of the two. As shown in Figure 5, the device includes:

接收模块41,用于接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及目标检测指标对应的运行数据,其中,目标节点为目标检测指标处于异常状态的节点。The receiving module 41 is configured to receive the target detection indicator sent by the proxy terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state.

确定模块42,用于根据运行数据确定目标检测指标对应的目标异常类型。The determining module 42 is configured to determine the target abnormality type corresponding to the target detection index according to the operation data.

查询模块43,用于查询目标异常类型对应的目标异常处理脚本。The query module 43 is configured to query the target exception processing script corresponding to the target exception type.

执行模块44,用于在存在目标异常处理脚本的情况下,执行目标异常处理脚本以使目标节点的目标检测指标恢复至正常状态。The execution module 44 is configured to execute the target exception handling script when the target exception handling script exists, so as to restore the target detection index of the target node to a normal state.

在本申请实施例中,查询模块43,用于从控制器的缓存中读取异常类型与异常处理脚本之间的映射关系;基于映射关系,获取目标异常类型对应的目标异常处理脚本。In the embodiment of the present application, the query module 43 is configured to read the mapping relationship between the exception type and the exception handling script from the cache of the controller; based on the mapping relationship, obtain the target exception handling script corresponding to the target exception type.

在本申请实施例中,在不存在异常处理脚本的情况下,节点异常的处理装置还包括:建立模块,用于向目标客户端发送目标检测指标,以及目标检测指标对应的运行数据;接收目标客户端基于目标检测指标以及运行数据反馈的目标异常处理脚本;建立目标异常处理脚本与目标异常类型之间的映射关系,并将映射关系存储至控制器的缓存。In the embodiment of the present application, in the absence of an exception processing script, the device for processing node exceptions further includes: a building module for sending target detection indicators and operation data corresponding to the target detection indicators to the target client; receiving the target Based on the target detection index and the target exception handling script fed back by the running data, the client establishes a mapping relationship between the target exception handling script and the target exception type, and stores the mapping relationship in the controller's cache.

本申请实施例还提供一种电子设备,如图5所示,电子设备可以包括:处理器1501、通信接口1502、存储器1503和通信总线1504,其中,处理器1501,通信接口1502,存储器1503通过通信总线1504完成相互间的通信。This embodiment of the present application also provides an electronic device. As shown in FIG. 5 , the electronic device may include: a processor 1501, a communication interface 1502, a memory 1503, and a communication bus 1504, wherein the processor 1501, the communication interface 1502, and the memory 1503 pass through The communication bus 1504 performs communication with each other.

存储器1503,用于存放计算机程序;The memory 1503 is used to store computer programs;

处理器1501,用于执行存储器1503上所存放的计算机程序时,实现上述实施例的步骤。The processor 1501 is configured to implement the steps of the foregoing embodiments when executing the computer program stored in the memory 1503 .

上述终端提到的通信总线可以是外设部件互连标准(Peripheral ComponentInterconnect,简称PCI)总线或扩展工业标准结构(Extended Industry StandardArchitecture,简称EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus mentioned by the above terminal may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA for short) bus or the like. The communication bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.

通信接口用于上述终端与其他设备之间的通信。The communication interface is used for communication between the above-mentioned terminal and other devices.

存储器可以包括随机存取存储器(Random Access Memory,简称RAM),也可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离前述处理器的存储装置。The memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage device located away from the aforementioned processor.

上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application SpecificIntegrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, referred to as CPU), a network processor (Network Processor, referred to as NP), etc.; may also be a digital signal processor (Digital Signal Processing, referred to as DSP) , Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.

在本申请提供的又一实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的节点状态的检测方法。In yet another embodiment provided by the present application, a computer-readable storage medium is also provided, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer is made to execute any one of the foregoing embodiments. The detection method of the node state.

在本申请提供的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例中任一所述的节点状态的检测方法。In yet another embodiment provided by the present application, a computer program product including instructions is also provided, which, when running on a computer, enables the computer to execute the method for detecting a node state described in any of the foregoing embodiments.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘SolidState Disk)等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk), among others.

以上所述仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本申请的保护范围内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the protection scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application are included in the protection scope of this application.

以上所述仅是本申请的具体实施方式,使本领域技术人员能够理解或实现本申请。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所申请的原理和新颖特点相一致的最宽的范围。The above descriptions are only specific embodiments of the present application, so that those skilled in the art can understand or implement the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.

Claims (10)

1.一种节点异常的处理方法,其特征在于,该方法应用于部署在目标集群中主节点上的控制器,所述方法包括:1. A method for processing abnormal nodes, wherein the method is applied to a controller deployed on a master node in a target cluster, the method comprising: 接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及所述目标检测指标对应的运行数据,其中,所述目标节点为目标检测指标处于异常状态的节点;Receive the target detection indicator sent by the agent terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state; 根据所述运行数据确定所述目标检测指标对应的目标异常类型;Determine the target abnormality type corresponding to the target detection indicator according to the operating data; 查询所述目标异常类型对应的目标异常处理脚本;query the target exception handling script corresponding to the target exception type; 在存在所述目标异常处理脚本的情况下,执行所述目标异常处理脚本以使所述目标节点的目标检测指标恢复至正常状态。If the target exception handling script exists, the target exception handling script is executed to restore the target detection index of the target node to a normal state. 2.根据权利要求1所述的方法,其特征在于,所述查询所述目标异常类型对应的异常处理脚本,包括:2. The method according to claim 1, wherein the querying the exception handling script corresponding to the target exception type comprises: 从所述控制器的缓存中读取异常类型与异常处理脚本之间的映射关系;Read the mapping relationship between the exception type and the exception handling script from the cache of the controller; 基于所述映射关系,获取所述目标异常类型对应的目标异常处理脚本。Based on the mapping relationship, a target exception handling script corresponding to the target exception type is acquired. 3.根据权利要求2所述的方法,其特征在于,在不存在所述异常处理脚本的情况下,所述方法还包括:3. The method according to claim 2, wherein in the absence of the exception handling script, the method further comprises: 向目标客户端发送所述目标检测指标,以及所述目标检测指标对应的运行数据;sending the target detection indicator and the operation data corresponding to the target detection indicator to the target client; 接收所述目标客户端基于所述目标检测指标以及所述运行数据反馈的目标异常处理脚本;receiving a target exception handling script fed back by the target client based on the target detection indicator and the operating data; 建立所述目标异常处理脚本与所述目标异常类型之间的映射关系,并将所述映射关系存储至所述控制器的缓存。A mapping relationship between the target exception handling script and the target exception type is established, and the mapping relationship is stored in the cache of the controller. 4.一种节点状态的检测方法,其特征在于,该方法应用于代理终端,所述代理终端部署在目标集群中的每个节点,所述方法包括:4. A method for detecting a node state, wherein the method is applied to a proxy terminal, and the proxy terminal is deployed on each node in a target cluster, the method comprising: 按照检测指标对应的检测策略对节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据;Periodically detect the node according to the detection strategy corresponding to the detection index, and obtain the operation data corresponding to each detection index in the node; 基于所述运行数据确定所述检测指标对应的状态信息;Determine the state information corresponding to the detection index based on the operating data; 将所述状态信息为异常状态的检测指标确定为目标检测指标;Determining the detection index whose state information is an abnormal state as the target detection index; 向控制器发送所述目标检测指标对应的运行数据,以使所述控制器根据所述运行数据对所述目标节点执行异常处理操作。The operation data corresponding to the target detection indicator is sent to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data. 5.根据权利要求4所述的方法,其特征在于,在所述检测指标为网络指标的情况下,所述按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据,包括:5 . The method according to claim 4 , wherein, in the case that the detection index is a network index, the node is periodically detected according to a detection strategy corresponding to the detection index, and the information in the node is obtained. 6 . Operation data corresponding to each detection indicator, including: 确定所述网络指标对应的网络检测策略;determining a network detection strategy corresponding to the network indicator; 利用所述网络检测策略检测所述节点所在管理网、业务网以及存储网分别对应的网络参数;Use the network detection strategy to detect network parameters corresponding to the management network, service network and storage network where the node is located; 将所述管理网、业务网以及存储网分别对应的网络参数确定为所述网络指标对应的运行数据。The network parameters corresponding to the management network, the service network and the storage network respectively are determined as the operation data corresponding to the network indicators. 6.根据权利要求4所述的方法,其特征在于,在所述检测指标为组件指标的情况下,所述按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据,包括:6 . The method according to claim 4 , wherein, in the case that the detection index is a component index, the node is periodically detected according to a detection strategy corresponding to the detection index, and the node is obtained. Operation data corresponding to each detection indicator, including: 确定所述组件指标对应的组件检测策略;determining the component detection strategy corresponding to the component indicator; 利用所述组件检测策略查询所述节点的日志文件,并从所述日志文件中统计所述节点中组件的运行数据。The log file of the node is queried by using the component detection strategy, and the running data of the components in the node is counted from the log file. 7.一种节点异常的处理装置,其特征在于,包括:7. A device for processing node abnormality, comprising: 接收模块,用于接收部署在目标集群中目标节点上的代理终端发送的目标检测指标,以及所述目标检测指标对应的运行数据,其中,所述目标节点为目标检测指标处于异常状态的节点;a receiving module, configured to receive the target detection indicator sent by the agent terminal deployed on the target node in the target cluster, and the operation data corresponding to the target detection indicator, wherein the target node is a node whose target detection indicator is in an abnormal state; 确定模块,用于根据所述运行数据确定所述目标检测指标对应的目标异常类型;a determining module, configured to determine the target abnormality type corresponding to the target detection index according to the operating data; 查询模块,用于查询所述目标异常类型对应的目标异常处理脚本;a query module for querying the target exception handling script corresponding to the target exception type; 执行模块,用于在存在所述目标异常处理脚本的情况下,执行所述目标异常处理脚本以使所述目标节点的目标检测指标恢复至正常状态。The execution module is configured to execute the target exception handling script in the presence of the target exception handling script to restore the target detection index of the target node to a normal state. 8.一种节点状态的检测装置,其特征在于,包括:8. A device for detecting a node state, comprising: 检测模块,用于按照检测指标对应的检测策略对所述节点进行周期性检测,得到所述节点中各个检测指标对应的运行数据;a detection module, configured to periodically detect the node according to the detection strategy corresponding to the detection index, and obtain operation data corresponding to each detection index in the node; 确定模块,用于基于所述运行数据确定所述检测指标对应的状态信息;a determination module, configured to determine the state information corresponding to the detection index based on the operation data; 处理模块,用于将所述状态信息为异常状态的检测指标确定为目标检测指标;a processing module, configured to determine a detection index whose state information is an abnormal state as a target detection index; 发送模块,用于向控制器发送所述目标检测指标对应的运行数据,以使所述控制器根据所述运行数据对所述目标节点执行异常处理操作。The sending module is configured to send the operation data corresponding to the target detection index to the controller, so that the controller performs an abnormal processing operation on the target node according to the operation data. 9.一种存储介质,其特征在于,所述存储介质包括存储的程序,其中,所述程序运行时执行上述权利要求1至6中任一项所述的方法步骤。9 . A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, the method steps of any one of the preceding claims 1 to 6 are executed. 10.一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;其中:10. An electronic device, characterized in that it comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus; wherein: 存储器,用于存放计算机程序;memory for storing computer programs; 处理器,用于通过运行存储器上所存放的程序来执行权利要求1至6中任一项所述的方法步骤。The processor is configured to execute the method steps of any one of claims 1 to 6 by running a program stored in the memory.
CN202210111973.7A 2022-01-29 2022-01-29 Node state detection method, node abnormity processing method and device Pending CN114443438A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210111973.7A CN114443438A (en) 2022-01-29 2022-01-29 Node state detection method, node abnormity processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210111973.7A CN114443438A (en) 2022-01-29 2022-01-29 Node state detection method, node abnormity processing method and device

Publications (1)

Publication Number Publication Date
CN114443438A true CN114443438A (en) 2022-05-06

Family

ID=81371202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210111973.7A Pending CN114443438A (en) 2022-01-29 2022-01-29 Node state detection method, node abnormity processing method and device

Country Status (1)

Country Link
CN (1) CN114443438A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349061A (en) * 2023-09-22 2024-01-05 天宇正清科技有限公司 Intelligent interface management method, system, terminal and storage medium
CN117768165A (en) * 2023-12-12 2024-03-26 暗物质(北京)智能科技有限公司 Network anomaly detection method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
CN108965049A (en) * 2018-06-28 2018-12-07 深信服科技股份有限公司 Method, equipment, system and the storage medium of cluster exception solution are provided
CN111949551A (en) * 2020-09-01 2020-11-17 网易(杭州)网络有限公司 Application program testing method, device, equipment and storage medium
CN112052111A (en) * 2020-09-08 2020-12-08 中国平安人寿保险股份有限公司 Processing method, device and equipment for server abnormity early warning and storage medium
CN113886122A (en) * 2021-09-30 2022-01-04 济南浪潮数据技术有限公司 System operation exception handling method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108521339A (en) * 2018-03-13 2018-09-11 广州西麦科技股份有限公司 A kind of reaction type node failure processing method and system based on cluster daily record
CN108965049A (en) * 2018-06-28 2018-12-07 深信服科技股份有限公司 Method, equipment, system and the storage medium of cluster exception solution are provided
CN111949551A (en) * 2020-09-01 2020-11-17 网易(杭州)网络有限公司 Application program testing method, device, equipment and storage medium
CN112052111A (en) * 2020-09-08 2020-12-08 中国平安人寿保险股份有限公司 Processing method, device and equipment for server abnormity early warning and storage medium
CN113886122A (en) * 2021-09-30 2022-01-04 济南浪潮数据技术有限公司 System operation exception handling method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349061A (en) * 2023-09-22 2024-01-05 天宇正清科技有限公司 Intelligent interface management method, system, terminal and storage medium
CN117768165A (en) * 2023-12-12 2024-03-26 暗物质(北京)智能科技有限公司 Network anomaly detection method, device, computer equipment and storage medium
CN117768165B (en) * 2023-12-12 2024-09-06 暗物质(北京)智能科技有限公司 Network anomaly detection method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107181834B (en) Method and device for managing virtual IP address by redis and redis system
US20210067404A1 (en) Determining the health of other nodes in a same cluster based on physical link information
US8943191B2 (en) Detection of an unresponsive application in a high availability system
WO2021213171A1 (en) Server switching method and apparatus, management node and storage medium
CN114443438A (en) Node state detection method, node abnormity processing method and device
CN114138522A (en) A fault recovery method, device, electronic device and medium for microservices
CN110851290A (en) A data synchronization method, device, electronic device and storage medium
CN114328034B (en) A storage device control method, device, electronic device and storage medium
EP3852363B1 (en) Device state monitoring method and apparatus
CN110932933A (en) Network condition monitoring method, computing device and computer storage medium
CN111314443A (en) Node processing method, device and device and medium based on distributed storage system
CN110224880A (en) A kind of heartbeat inspecting method and monitoring device
CN109586989B (en) State checking method, device and cluster system
CN111478792B (en) A method, system and device for processing cutover information
CN106911519A (en) A kind of data acquisition monitoring method and device
CN114584454B (en) Processing method and device of server information, electronic equipment and storage medium
CN111355765B (en) A method and device for processing and sending network requests
CN111258845A (en) Detection of event storms
WO2025103171A1 (en) Cdn-based service alarm processing method and apparatus, and device and medium
CN114531499B (en) A port sharing method, system and server based on QUIC protocol
CN115580522A (en) Method and device for monitoring running state of container cloud platform
CN114356810A (en) A communication connection method, device, device and medium between a host and a storage system
CN114338794A (en) Service message pushing method and device, electronic equipment and storage medium
CN114356625A (en) Distributed system redundancy diagnosis method, device, electronic device and storage medium
CN114584462A (en) Network service processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination