CN106713014B - Monitored host in monitoring system, monitoring system and monitoring method - Google Patents
Monitored host in monitoring system, monitoring system and monitoring method Download PDFInfo
- Publication number
- CN106713014B CN106713014B CN201611088934.0A CN201611088934A CN106713014B CN 106713014 B CN106713014 B CN 106713014B CN 201611088934 A CN201611088934 A CN 201611088934A CN 106713014 B CN106713014 B CN 106713014B
- Authority
- CN
- China
- Prior art keywords
- monitoring
- service
- module
- key
- tool module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0695—Management of faults, events, alarms or notifications the faulty arrangement being the maintenance, administration or management system
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
一种被监控主机、监控系统和监控方法,用于提供监控主机与外部业务系统交互的能力。该被监控主机包括监控客户端、代理工具模块以及业务模块,所述代理工具模块向所述业务模块提供服务接口,所述业务模块记录每种业务失败类别关联的一个key,以及与key一一对应的故障描述参数集;当业务模块与主机外部的业务系统交互失败时,所述业务模块通过所述服务接口向代理工具模块发送业务失败对应的key以及本次失败对应的故障描述参数的取值;代理工具模块将所述各个故障描述参数的取值写入所述key对应的模板文件,生成并向监控服务端上报监控信息。通过上述方式,实现了监控非被监控主机故障引起的业务失败。
A monitored host, a monitoring system and a monitoring method are used to provide the monitoring host with the ability to interact with an external business system. The monitored host includes a monitoring client, an agent tool module and a business module, the agent tool module provides a service interface to the business module, the business module records a key associated with each business failure category, and one-to-one keys associated with the key The corresponding fault description parameter set; when the business module fails to interact with the business system outside the host, the business module sends the key corresponding to the business failure and the value of the fault description parameter corresponding to this failure to the proxy tool module through the service interface. value; the proxy tool module writes the value of each fault description parameter into the template file corresponding to the key, generates and reports monitoring information to the monitoring server. Through the above method, it is possible to monitor the service failure caused by the failure of the non-monitored host.
Description
技术领域technical field
本申请涉及网络技术领域,特别涉及一种监控系统中的被监控主机、监控系统以及监控方法。The present application relates to the field of network technologies, and in particular, to a monitored host in a monitoring system, a monitoring system, and a monitoring method.
背景技术Background technique
Zabbix是一个开源分布式监控系统,可以对网络设备进行数据监控。如图1所示,Zabbix监控系统中包括服务主机和若干个被监控主机,图1中仅显示一个被监控主机。服务主机中包括Zabbix网络(即web)图形用户界面(Graphical User Interface,GUI),Zabbix数据库和Zabbix服务端。Zabbix实现的一种设备监控方案中,在被监控主机中安装Zabbix客户端和监控脚本。用户通过Zabbix网络GUI在Zabbix服务端中添加监控项等一些配置信息,在监控客户端的配置文件中配置监控项的key和对应的监控脚本。Zabbix客户端会从Zabbix服务端中同步监控项等一些配置信息,根据这些配置信息调度对应的监控脚本采集监控数据,并把采集到的监控数据上报给Zabbix服务端。Zabbix服务端将收到的监控数据存入到Zabbix数据库,用户通过Zabbix网络GUI可以查看监控数据的结果。Zabbix is an open source distributed monitoring system that enables data monitoring of network devices. As shown in Figure 1, the Zabbix monitoring system includes a service host and several monitored hosts, and only one monitored host is shown in Figure 1. The service host includes a Zabbix network (ie web) graphical user interface (Graphical User Interface, GUI), a Zabbix database and a Zabbix server. In a device monitoring solution implemented by Zabbix, the Zabbix client and monitoring scripts are installed on the monitored host. The user adds some configuration information such as monitoring items in the Zabbix server through the Zabbix network GUI, and configures the key of the monitoring item and the corresponding monitoring script in the configuration file of the monitoring client. The Zabbix client will synchronize some configuration information such as monitoring items from the Zabbix server, schedule the corresponding monitoring script to collect monitoring data according to the configuration information, and report the collected monitoring data to the Zabbix server. The Zabbix server stores the received monitoring data into the Zabbix database, and users can view the results of the monitoring data through the Zabbix network GUI.
被监控主机上运行有服务进程,用户通过服务进程与主机外面的业务系统交互,所述业务系统可以为通信系统、数据库系统或者web服务系统等等。由于Zabbix系统只对被监控主机自身进行监控,当被监控主机与业务系统之间的链路发生故障或者业务系统发生故障时,Zabbix系统无法及时发现该故障。例如,被监控主机可以与通信系统相连,被监控主机可以连接到运营商的短消息网关,使用所述被监控主机的用户通过短消息网关发送短消息。但是被监控主机与短消息网关之间的通信链路发生故障或者短消息网关本身发生故障,用户无法通过被监控主机发送短消息,从而造成用户的短消息业务失败,由于被监控主机自身没有发生故障,Zabbix系统无法及时获知用户的短消息业务故障,从而无法及时向管理员或用户上报短消息故障信息。A service process runs on the monitored host, and the user interacts with a business system outside the host through the service process, and the business system may be a communication system, a database system, or a web service system, and so on. Since the Zabbix system only monitors the monitored host itself, when the link between the monitored host and the business system fails or the business system fails, the Zabbix system cannot detect the fault in time. For example, the monitored host may be connected to a communication system, the monitored host may be connected to an operator's short message gateway, and users using the monitored host send short messages through the short message gateway. However, if the communication link between the monitored host and the short message gateway fails or the short message gateway itself fails, the user cannot send short messages through the monitored host, thus causing the user's short message service to fail. If there is a fault, the Zabbix system cannot know the user's short message service fault in time, so it cannot report the short message fault information to the administrator or user in time.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种监控系统中的被监控主机、监控系统以及监控方法,用以解决被监控主机与业务系统之间的链路发生故障或者业务系统发生故障时,监控系统无法及时发现该故障的问题。Embodiments of the present application provide a monitored host in a monitoring system, a monitoring system, and a monitoring method, so as to solve the problem that when a link between a monitored host and a business system fails or a business system fails, the monitoring system cannot detect the failure in time. problem of failure.
本申请实施例提供的具体技术方案如下:第一方面,提供一种监控系统中的被监控主机,该被监控主机包括监控客户端、代理工具模块以及业务模块,所述代理工具模块向所述业务模块提供服务接口,所述业务模块记录每种业务失败类别关联的一个key,以及与key一一对应的故障描述参数集;代理工具模块记录key与模板文件的对应关系,所述模板文件中包括所述key对应的故障描述参数集;当业务模块与主机外部的业务系统交互失败时,所述业务模块通过所述服务接口向代理工具模块发送业务失败对应的key以及本次失败对应的故障描述参数的取值;代理工具模块将所述各个故障描述参数的取值写入所述key对应的模板文件,生成并向监控服务端上报监控信息。本发明实施例通过在被监控主机中新增代理工具模块,由代理工具模块向业务模块提供服务接口,定义了业务模块在业务失败后的故障上报流程,实现了监控系统对非被监控主机故障引起的业务失败的监控。业务模块无需与监控系统耦合,业务模块只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。The specific technical solutions provided by the embodiments of the present application are as follows: In a first aspect, a monitored host in a monitoring system is provided, where the monitored host includes a monitoring client, an agent tool module, and a business module, and the agent tool module reports to the The business module provides a service interface, the business module records a key associated with each type of business failure, and a set of failure description parameters corresponding to the key one-to-one; the proxy tool module records the corresponding relationship between the key and the template file, in the template file Including the failure description parameter set corresponding to the key; when the interaction between the business module and the business system outside the host fails, the business module sends the key corresponding to the business failure and the failure corresponding to this failure to the proxy tool module through the service interface. The value of the description parameter; the proxy tool module writes the value of each fault description parameter into the template file corresponding to the key, and generates and reports monitoring information to the monitoring server. In the embodiment of the present invention, by adding an agent tool module to the monitored host, the agent tool module provides a service interface to the business module, defines the fault reporting process of the business module after the business fails, and realizes the monitoring system for the failure of the non-monitored host. Monitoring of business failures caused. The business module does not need to be coupled with the monitoring system. The business module only needs to define the key and all the fault description parameters in JSON format required by the abnormal business scenario according to its own business exception.
在一种可能的设计中,所述代理工具模块在将所述各个故障描述参数的取值写入模板文件之后,生成value,所述value为所述各个故障描述参数的取值对应的字符串;相应地,所述监控信息包括本次业务失败的类别对应的key以及所述value。In a possible design, the proxy tool module generates a value after writing the value of each fault description parameter into the template file, where the value is a character string corresponding to the value of each fault description parameter ; Correspondingly, the monitoring information includes the key corresponding to the category of this service failure and the value.
在另一种可能的设计中,所述代理工具模块可以通过调用所述监控客户端的命令行工具,将所述监控信息上报给所述监控服务端;或者,所述代理工具模块将所述监控信息发送给所述监控客户端,以使得所述监控客户端将所述监控信息发送给所述监控服务端。In another possible design, the proxy tool module may report the monitoring information to the monitoring server by invoking a command line tool of the monitoring client; or, the proxy tool module may report the monitoring information to the monitoring server. The information is sent to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.
在另一种可能的设计中,所述代理工具模块通过所述服务接口向所述业务模块提供本地环回地址,接收所述业务模块通过HTTP方式传递的所述业务故障信息。In another possible design, the proxy tool module provides a local loopback address to the service module through the service interface, and receives the service failure information transmitted by the service module through HTTP.
所述故障描述参数可以采用JSON对象。The fault description parameter may use a JSON object.
在一种可能的场景中,本发明实施例通过对模板文件和JSON对象的组合变化,可以根据需要自定义监控信息的内容和格式,向监控服务端上报自定义的监控信息,方便系统管理员查看详细的业务异常情况。再一方面,由于业务模块与代理工具模块之间采用无状态的HTTP通信,即使监控系统的进程发生故障,也不会影响到业务模块,从而不会对用户的业务产生影响,保障了业务的安全。In a possible scenario, the embodiment of the present invention can customize the content and format of the monitoring information as required by changing the combination of the template file and the JSON object, and report the customized monitoring information to the monitoring server, which is convenient for the system administrator View detailed business exceptions. On the other hand, since the stateless HTTP communication is used between the business module and the agent tool module, even if the process of the monitoring system fails, it will not affect the business module, thus not affecting the user's business and ensuring the business continuity. Safety.
所述代理工具模块,还可以执行流控策略,限定同一类别的业务失败的上报频率。所述流控策略包括限定同一key值对应的监控信息的上报频率不大于预设值。The proxy tool module may also implement a flow control policy to limit the reporting frequency of service failures of the same category. The flow control policy includes limiting the reporting frequency of monitoring information corresponding to the same key value to be no greater than a preset value.
所述代理工具模块与所述监控客户端合设。The agent tool module is co-located with the monitoring client.
第二方面,提供一种监控系统,包括:监控客户端、代理工具模块以及监控服务端,所述监控客户端以及所述代理工具模块运行在被监控主机上,所述代理工具模块向业务模块提供服务接口;其中,所述所述代理工具模块具有实现上述第一方面中所述的所述代理工具模块的功能。In a second aspect, a monitoring system is provided, including: a monitoring client, an agent tool module, and a monitoring server, the monitoring client and the agent tool module run on a monitored host, and the agent tool module reports to a business module A service interface is provided; wherein, the proxy tool module has the function of implementing the proxy tool module described in the first aspect above.
第三方面,提供一种监控方法,与前述第一方面相对应,业务模块、代理工具模块以及监控服务端执行第一方面中的对应模块的功能。A third aspect provides a monitoring method. Corresponding to the aforementioned first aspect, the business module, the agent tool module, and the monitoring server perform the functions of the corresponding modules in the first aspect.
第四方面,提供了另一种监控系统中的被监控主机,该监控系统中的被监控主机具有实现上述第一方面和任一种可能的设计中被监控主机行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In a fourth aspect, a monitored host in another monitoring system is provided, and the monitored host in the monitoring system has the function of implementing the behavior of the monitored host in the first aspect and any possible design. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions.
在一个可能的设计中,该监控系统中的被监控主机包括收发器和处理器,其中,处理器用于调用一组程序代码,以执行如第二方面和任一种可能的设计中所述的方法。In one possible design, the monitored host in the monitoring system includes a transceiver and a processor, wherein the processor is configured to invoke a set of program codes to execute as described in the second aspect and any of the possible designs method.
第五方面,提供了一种计算机存储介质,用于储存为上述方面所述的被监控主机所用的计算机软件指令,其包含用于执行上述方面所设计的程序。In a fifth aspect, a computer storage medium is provided for storing computer software instructions used by the monitored host described in the above-mentioned aspects, including the program designed for executing the above-mentioned aspects.
附图说明Description of drawings
图1为现有技术中Zabbix监控系统架构图;Fig. 1 is the Zabbix monitoring system architecture diagram in the prior art;
图2为本申请实施例中监控系统架构图;FIG. 2 is an architecture diagram of a monitoring system in an embodiment of the present application;
图3为本申请实施例中监控方法的流程示意图;3 is a schematic flowchart of a monitoring method in an embodiment of the present application;
图4为本申请实施例中监控系统中的被监控主机硬件结构示意图。FIG. 4 is a schematic diagram of the hardware structure of the monitored host in the monitoring system according to the embodiment of the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be further described below with reference to the accompanying drawings.
如图2所示,为本发明实施例提供的一种监控系统结构示意图,该监控系统包括被监控主机11和监控服务端12,监控系统通过网络13与业务系统14相连,具体的,被监控主机包括监控客户端111、代理工具模块112以及业务模块113。其中,所述代理工具模块112向所述业务模块113提供服务接口,所述业务模块113通过所述服务接口与所述代理工具模块112交互,所述业务模块113通过调用业务系统14的接口与业务系统交互。As shown in FIG. 2, it is a schematic structural diagram of a monitoring system provided by an embodiment of the present invention. The monitoring system includes a monitored
在一种可能的场景中,用户登录所述被监控主机11,运行主机11上的应用,该应用通过所述业务模块113接入到外部业务系统14,访问业务系统14提供的业务。例如,业务系统14可以为短消息系统,业务模块113通过网络13连接到业务系统14的短消息中心,通过短消息中心对外发送短消息。In a possible scenario, the user logs in to the monitored
当主机11本身运行正常,主机与业务系统14之间的链路故障或者业务系统14故障,导致用户无法正常使用短消息业务时,监控服务端需要及时获知该故障的发生,为了实现上述目的,在本发明实施例中,在主机11上新增代理工具模块112,代理工具模块112提供服务接口给业务模块113,在业务模块113检测到业务失败时,通过所述服务接口向代理工具模块112上报业务失败信息,从而通过代理工具模块112实现将业务失败的监控信息上报给监控服务端。When the
在一种可能的场景中,主机11可以为物理服务器集群中的任一物理服务器,该物理服务器集群可以为云计算物理服务器集群,对用户提供云服务;在另一种可能的场景中,主机11可以为独立的物理服务器。监控服务端12可以运行在独立的物理服务器上。In a possible scenario, the
在一种可能的设计中,所述业务模块113可以为服务进程,用于处理业务。示例性的,该业务模块113可以为与短消息中心通信的NS(notification service)进程。In a possible design, the
所述业务模块113连接外部业务系统14,访问业务系统14提供的业务。当业务访问失败时,所述业务模块113确定业务失败的类别,以及业务失败的故障描述参数集。The
需要说明的是,业务失败的类别表示引起业务失败的因素。示例性的,业务失败的类别可以包括链路故障、账号不合法或业务系统故障等等,故障描述参数集可以包括进程标识、业务系统地址以及故障指示等等可以准确描述业务失败原因的参数的集合。不同的业务失败类别可以对应不同的故障描述参数集。本领域技术人员可以理解的是,业务失败的类别和故障描述参数集可以根据不同的场景进行灵活的定义,本发明实施例并不将所述业务失败的类别限定为上述举例。It should be noted that the categories of business failures represent factors that cause business failures. Exemplarily, the categories of service failures may include link failures, invalid accounts, or service system failures, etc., and the failure description parameter set may include process identifiers, service system addresses, and failure indications, etc., which can accurately describe the cause of the service failure. gather. Different service failure categories may correspond to different sets of failure description parameters. Those skilled in the art can understand that the types of service failures and the set of fault description parameters can be flexibly defined according to different scenarios, and the embodiment of the present invention does not limit the types of service failures to the above examples.
进一步的,监控系统可以使用key-value的格式描述监控信息。此时,在监控系统开始运行前,所述业务模块113可以记录每种业务失败类别关联的一个key,以及与key一一对应的故障描述参数集,所述业务模块113将各个业务失败类别对应的key以及与key一一对应的故障描述参数集发送给代理工具模块112。在一种可能的设计中,针对每种业务失败类别分配一个key,所述key可以唯一标识该业务失败类别。例如,业务失败类别为连接超时、链路无响应、链路端口故障等等,对应的key可以自由设定,也可以依据任务的定义规则设定。Further, the monitoring system can describe monitoring information in a key-value format. At this time, before the monitoring system starts to run, the
所述代理工具模块112记录key与模板文件的对应关系,所述模板文件中包括所述key对应的故障描述参数集。The
在一种可能的设计中,所述代理工具模块112对业务模块113提供RESTful接口,接口的请求内容可以是任意的JSON(JavaScript Object Notation)格式数据。所述代理工具模块112生成模板文件,故障描述参数集中的各个故障描述参数为模板文件中的元数据。一个业务失败类型可以对应一个模板文件。其中,JSON对象为一种轻量级的数据交换的语法格式,关于JSON的具体说明请参阅https://www.w3.org/TR/json-ld/。In a possible design, the
所述业务模块113通过网络13与业务系统14交互,当业务失败时,通过所述服务接口向所述代理工具模块112发送业务故障信息,所述业务故障信息包括本次业务失败的类别对应的key以及故障描述参数集中各个故障描述参数的取值,其中,各个故障描述参数的取值可以准确表示本次业务失败的信息,包括业务名称以及失败原因等等,具体的,可以通过进程标识来表示业务名称。The
所述代理工具模块112通过所述服务接口接收所述业务模块113发送的业务故障信息,根据所述对应关系查找所述key对应的模板文件,将所述各个故障描述参数的取值写入模板文件,生成监控信息。The
在一种可能的设计中,所述代理工具模块112读取业务模块113(服务进程)调用RESTful接口提供的JSON格式的数据,各个故障描述参数的取值写入到模板文件中,并生成最终的监控信息。具体的,所述代理工具模块112根据写入后的模板文件生成value,所述value为所述各个故障描述参数的取值对应的字符串,相应地,所述监控信息包括本次业务失败的类别对应的key以及所述value。In a possible design, the
所述代理工具模块112通过所述监控客户端111将生成的监控信息上报给监控服务端12。The
所述代理工具模块112可以通过同步或异步的方式调用所述监控客户端111的命令行工具,将所述监控信息上报给所述监控服务端12;或者,所述代理工具模块112将所述监控信息发送给所述监控客户端111,以使得所述监控客户端111将所述监控信息发送给所述监控服务端12。The
为提供监控系统的监控效率,避免同一故障重复、高频率的上报,所述代理工具模块112还可以具备流控功能,限制同一种监控信息重复发送的次数。例如,所述代理工具模块112执行流控策略,所述流控策略包括限定同一key值对应的监控信息的上报频率不大于预设值。本领域技术人员理解的是,上述预设值可以根据需求由系统管理员灵活设定,优选的,上报频率可以根据业务重要性进行设定。In order to improve the monitoring efficiency of the monitoring system and avoid repeated and high-frequency reporting of the same fault, the
在一种可能的设计中,所述代理工具模块112可以单独部署,也可以与监控客户端111合并部署。In a possible design, the
所述代理工具模块112通过RESTful服务接口向所述业务模块113提供本地环回地址(如:127.0.0.1),接收所述业务模块113通过超文本传输协议(HTTP,HyperTextTransfer Protocol)方式传递的业务故障信息。具体的,关于具象状态传输(Representational state transfer,REST)架构和RESTful接口的说明可以参阅https://zh.wikipedia.org/wiki/REST以及https://en.wikipedia.org/wiki/RESTful。The
本申请实施例提供一种监控系统中的被监控主机11,在该主机11中新增了代理工具模块112(AgentTool),通过上述方案,解决了解决业务模块113在业务失败后的故障上报问题。一方面,业务模块113无需与监控系统耦合,业务模块113只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。另一方面,通过对模板文件和JSON对象的组合变化,可以根据需要自定义监控信息的内容和格式,向监控服务端上报自定义的监控信息,方便系统管理员查看详细的业务异常情况。再一方面,由于业务模块113与代理工具模块112之间采用无状态的HTTP通信,即使监控系统的进程发生故障,也不会影响到业务模块,从而不会对用户的业务产生影响,保障了业务的安全。The embodiment of the present application provides a monitored
在本发明实施例中,所述监控系统可以为Zabbix系统,前述的命令行工具可以为Zabbix Sender,可以传递Key/Value参数。In this embodiment of the present invention, the monitoring system may be a Zabbix system, and the aforementioned command line tool may be Zabbix Sender, which may transmit Key/Value parameters.
基于图2所示的监控系统的架构,下面将对本申请实施例提供的监控方法进行说明。Based on the architecture of the monitoring system shown in FIG. 2 , the monitoring method provided by the embodiment of the present application will be described below.
参阅图3所示,为本申请实施例提供的一种监控方法。Referring to FIG. 3 , a monitoring method is provided in an embodiment of the present application.
步骤301:业务模块通过网络接入业务系统,根据业务可能存在的异常情况,确定业务失败类别对应的key以及业务失败的故障描述参数集。Step 301: The service module accesses the service system through the network, and determines the key corresponding to the service failure category and the set of failure description parameters of the service failure according to possible abnormal conditions of the service.
业务失败的类别表示引起业务失败的因素。示例性的,业务失败的类别可以包括链路故障、账号不合法或业务系统故障等等,故障描述参数集可以包括进程标识、业务系统地址以及故障指示等等可以准确描述业务失败原因的参数的集合。不同的业务失败类别可以对应不同的故障描述参数集。故障描述参数集中的各个故障描述参数可以采用JSON格式。The business failure category represents the factors that caused the business failure. Exemplarily, the categories of service failures may include link failures, invalid accounts, or service system failures, etc., and the failure description parameter set may include process identifiers, service system addresses, and failure indications, etc., which can accurately describe the cause of the service failure. gather. Different service failure categories may correspond to different sets of failure description parameters. Each fault description parameter in the fault description parameter set can be in JSON format.
示例如下:An example is as follows:
Key:smn-001-001Key: smn-001-001
JSON体:故障描述参数集JSON body: set of fault description parameters
{{
"Subject":"Channel Checking","Subject":"Channel Checking",
"ServiceName":"SMN-NS""ServiceName":"SMN-NS"
"ServiceAddress":"127.0.0.1""ServiceAddress": "127.0.0.1"
"Error":"Error""Error": "Error"
}}
步骤302:业务模块将将各个业务失败类别对应的key以及与key一一对应的故障描述参数集发送给代理工具模块。Step 302: The service module sends the key corresponding to each service failure category and the failure description parameter set corresponding to the key one-to-one to the agent tool module.
在一种可能的设计中,所述代理工具模块对业务模块提供RESTful接口。In a possible design, the proxy tool module provides a RESTful interface to the business module.
步骤303:代理工具模块接收业务模块发送的key以及与key一一对应的故障描述参数集,记录key与模板文件的对应关系,所述模板文件中包括所述key对应的故障描述参数集。Step 303: The proxy tool module receives the key sent by the business module and the fault description parameter set corresponding to the key one-to-one, and records the correspondence between the key and the template file, where the template file includes the fault description parameter set corresponding to the key.
所述代理工具模块生成模板文件,故障描述参数集中的各个故障描述参数为模板文件中的动态变量,用于表示元数据。一个业务失败类型可以对应一个模板文件。The proxy tool module generates a template file, and each fault description parameter in the fault description parameter set is a dynamic variable in the template file, which is used to represent metadata. A business failure type can correspond to a template file.
模板文件示例:Template file example:
步骤304:系统管理员通过监控系统的图形用户界面登录监控服务端,创建key以及监控指标。Step 304: The system administrator logs in to the monitoring server through the graphical user interface of the monitoring system, and creates keys and monitoring indicators.
在一种可能的设计中,代理工具模块可以将key以及故障描述参数集发送给监控服务端,监控服务端创建key以及监控指标,所述监控指标可以为文本格式,用于呈现接收到的监控信息。In a possible design, the agent tool module can send the key and the fault description parameter set to the monitoring server, and the monitoring server creates the key and monitoring indicators, and the monitoring indicators can be in text format for presenting the received monitoring information.
步骤305:业务模块通过网络访问业务系统,当发现业务失败时,调用代理工具模块提供的服务接口,向代理工具模块发送业务故障信息,所述业务故障信息包括本次业务失败的类别对应的key以及故障描述参数集中各个故障描述参数的取值,其中,各个故障描述参数的取值可以准确表示本次业务失败的信息,包括业务名称以及失败原因等等。Step 305: The business module accesses the business system through the network, and when a business failure is found, the service interface provided by the agent tool module is invoked, and the business failure information is sent to the agent tool module, where the business failure information includes the key corresponding to the category of the current business failure and the value of each fault description parameter in the fault description parameter set, wherein the value of each fault description parameter can accurately represent the information of the current service failure, including the service name and the failure reason.
步骤306:代理工具模块通过所述服务接口接收所述业务模块发送的业务故障信息,根据所述对应关系查找所述key对应的模板文件,将所述各个故障描述参数的取值写入模板文件,生成监控信息。Step 306: The agent tool module receives the service failure information sent by the service module through the service interface, searches for the template file corresponding to the key according to the corresponding relationship, and writes the values of the respective failure description parameters into the template file to generate monitoring information.
步骤307:代理工具模块将监控信息发送给监控服务端。Step 307: The agent tool module sends the monitoring information to the monitoring server.
具体的,代理工具模块读取故障描述参数集中JSON格式的数据,将各个故障描述参数的取值写入到模板文件中生成value,所述value为所述各个故障描述参数的取值对应的字符串。代理工具模块将业务失败类型对应的key以及生成的所述value作为命令行工具的入参,调用监控客户端的命令行工具将监控信息(即所述key以及生成的所述value)发送给监控服务端。Specifically, the proxy tool module reads the data in JSON format in the fault description parameter set, and writes the value of each fault description parameter into the template file to generate a value, where the value is the character corresponding to the value of each fault description parameter string. The proxy tool module uses the key corresponding to the business failure type and the generated value as the input parameters of the command line tool, and calls the command line tool of the monitoring client to send the monitoring information (that is, the key and the generated value) to the monitoring service end.
具体的,代理工具模块可以通过同步或异步的方式调用所述监控客户端的命令行工具,将所述监控信息上报给所述监控服务端;或者,所述代理工具模块将所述监控信息发送给所述监控客户端,以使得所述监控客户端将所述监控信息发送给所述监控服务端。Specifically, the proxy tool module may call the command line tool of the monitoring client in a synchronous or asynchronous manner, and report the monitoring information to the monitoring server; or, the proxy tool module may send the monitoring information to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.
所述代理工具模块与所述监控服务端之间可以采用基于JSON的远程调用协议(JSON-remote protocol call,JSON-RPC)。A JSON-based remote call protocol (JSON-remote protocol call, JSON-RPC) may be used between the proxy tool module and the monitoring server.
监控信息示例如下:Examples of monitoring information are as follows:
key:smn-001-001key: smn-001-001
value:value:
Senior Alert:Channel CheckingSenior Alert: Channel Checking
Components SMN-NS 127.0.0.1Components SMN-NS 127.0.0.1
Error:Can not connect channelError: Can not connect channel
步骤308:监控服务端接收监控信息,确定业务失败时,触发告警,通知系统管理员。Step 308: The monitoring server receives the monitoring information, and when it is determined that the service fails, triggers an alarm and notifies the system administrator.
本发明实施例通过在被监控主机中新增代理工具模块,由代理工具模块向业务模块提供服务接口,定义了业务模块113在业务失败后的故障上报流程,实现了监控系统对非被监控主机故障引起的业务失败的监控。一方面,业务模块无需与监控系统耦合,业务模块只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。另一方面,通过对模板文件和JSON对象的组合变化,可以根据需要自定义监控信息的内容和格式,向监控服务端上报自定义的监控信息,方便系统管理员查看详细的业务异常情况。再一方面,由于业务模块与代理工具模块之间采用无状态的HTTP通信,即使监控系统的进程发生故障,也不会影响到业务模块,从而不会对用户的业务产生影响,保障了业务的安全。In the embodiment of the present invention, by adding an agent tool module to the monitored host, the agent tool module provides a service interface to the business module, and defines the fault reporting process of the
与前述监控系统和监控方法相对应,本发明实施例提供了一种被监控主机,包括前述的监控客户端、代理工具模块以及业务模块。被监控主机中的各模块执行前述监控系统以及监控方法中的功能,本发明实施例再次不再赘述。Corresponding to the aforementioned monitoring system and monitoring method, an embodiment of the present invention provides a monitored host, including the aforementioned monitoring client, an agent tool module, and a service module. Each module in the monitored host executes the functions in the aforementioned monitoring system and monitoring method, which will not be repeated in this embodiment of the present invention.
与前述监控系统和监控方法相对应,本发明实施例提供了另一种被监控主机,包括前述的监控客户端以及代理工具模块。被监控主机中的各模块执行前述监控系统以及监控方法中的功能,本发明实施例再次不再赘述。此时业务模块可以位于与被监控主机存在网络连接关系的另一主机上。Corresponding to the aforementioned monitoring system and monitoring method, an embodiment of the present invention provides another monitored host, including the aforementioned monitoring client and an agent tool module. Each module in the monitored host executes the functions in the aforementioned monitoring system and monitoring method, which will not be repeated in this embodiment of the present invention. At this time, the service module may be located on another host that has a network connection relationship with the monitored host.
基于同一发明构思,参阅图4所示,本申请实施例还提供了另一种监控系统中的被监控主机400,包括收发器401、处理器402、存储器403,收发器401、存储器403均与处理器402连接,需要说明的是图4所示的各部分之间的连接方式仅为一种可能的示例,也可以是,收发器401与存储器403均与处理器402连接,且收发器401与存储器403之间没有连接,或者,也可以是其他可能的连接方式。Based on the same inventive concept, referring to FIG. 4 , the embodiment of the present application further provides a
其中,存储器403中存储一组程序,处理器402用于调用存储器403中存储的程序,以执行前述图2以及图3所示的监控系统和监控方法中被监控主机的各模块的功能。The
在图4中,处理器402可以是中央处理器(英文:central processing unit,缩写:CPU),网络处理器(英文:network processor,缩写:NP)或者CPU和NP的组合。In FIG. 4 , the
处理器402还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文:application-specific integrated circuit,缩写:ASIC),可编程逻辑器件(英文:programmable logic device,缩写:PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),现场可编程逻辑门阵列(英文:field-programmable gate array,缩写:FPGA),通用阵列逻辑(英文:generic arraylogic,缩写:GAL)或其任意组合。The
存储器401可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);存储器401也可以包括非易失性存储器(英文:non-volatile memory),例如快闪存储器(英文:flash memory),硬盘(英文:hard diskdrive,缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD);存储器401还可以包括上述种类的存储器的组合。The
监控服务端所在的物理服务器也可以采用如图4所示的硬件结构。本发明实施例不再赘述。The physical server where the monitoring server is located may also adopt the hardware structure shown in FIG. 4 . This embodiment of the present invention will not be described repeatedly.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
这些计算机程序代码可以存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中。Such computer program codes may be stored in computer readable memory which directs a computer or other programmable data processing device to function in a particular manner.
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.
Claims (17)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611088934.0A CN106713014B (en) | 2016-11-30 | 2016-11-30 | Monitored host in monitoring system, monitoring system and monitoring method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611088934.0A CN106713014B (en) | 2016-11-30 | 2016-11-30 | Monitored host in monitoring system, monitoring system and monitoring method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106713014A CN106713014A (en) | 2017-05-24 |
CN106713014B true CN106713014B (en) | 2020-01-10 |
Family
ID=58934415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611088934.0A Active CN106713014B (en) | 2016-11-30 | 2016-11-30 | Monitored host in monitoring system, monitoring system and monitoring method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106713014B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110752939B (en) * | 2018-07-24 | 2022-09-16 | 成都华为技术有限公司 | Service process fault processing method, notification method and device |
CN112769622A (en) * | 2021-01-18 | 2021-05-07 | 孙冬英 | Cluster service fault early warning system based on RPC service monitoring |
CN114968732A (en) * | 2022-04-06 | 2022-08-30 | 亿玛创新网络(天津)有限公司 | Automatic generation method and device of monitoring graph, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1848764A (en) * | 2005-12-22 | 2006-10-18 | 华为技术有限公司 | A remote management and maintenance system for server and network equipment and its implementation method |
CN103003802A (en) * | 2010-07-15 | 2013-03-27 | 思科技术公司 | Monitoring of systems on the path |
CN103176892A (en) * | 2011-12-20 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Page monitoring method and system |
CN105915405A (en) * | 2016-03-29 | 2016-08-31 | 深圳市中博科创信息技术有限公司 | Large-scale cluster node performance monitoring system |
-
2016
- 2016-11-30 CN CN201611088934.0A patent/CN106713014B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1848764A (en) * | 2005-12-22 | 2006-10-18 | 华为技术有限公司 | A remote management and maintenance system for server and network equipment and its implementation method |
CN103003802A (en) * | 2010-07-15 | 2013-03-27 | 思科技术公司 | Monitoring of systems on the path |
CN103176892A (en) * | 2011-12-20 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Page monitoring method and system |
CN105915405A (en) * | 2016-03-29 | 2016-08-31 | 深圳市中博科创信息技术有限公司 | Large-scale cluster node performance monitoring system |
Also Published As
Publication number | Publication date |
---|---|
CN106713014A (en) | 2017-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110036600B (en) | Network health data aggregation service | |
CN110740072B (en) | Fault detection method, device and related equipment | |
US10333816B2 (en) | Key network entity detection | |
US10097433B2 (en) | Dynamic configuration of entity polling using network topology and entity status | |
US10356041B2 (en) | Systems and methods for centralized domain name system administration | |
CN103605722B (en) | Database monitoring method and device, equipment | |
US10542071B1 (en) | Event driven health checks for non-HTTP applications | |
US11799892B2 (en) | Methods for public cloud database activity monitoring and devices thereof | |
WO2021213171A1 (en) | Server switching method and apparatus, management node and storage medium | |
CN107241211A (en) | Improve the method and system of relevance between data center's overlay network and bottom-layer network | |
US10931513B2 (en) | Event-triggered distributed data collection in a distributed transaction monitoring system | |
US11522765B2 (en) | Auto discovery of network proxies | |
US10462031B1 (en) | Network visibility for cotenant processes | |
US11635972B2 (en) | Multi-tenant java agent instrumentation system | |
CN106713014B (en) | Monitored host in monitoring system, monitoring system and monitoring method | |
US10805144B1 (en) | Monitoring interactions between entities in a network by an agent for particular types of interactions and indexing and establishing relationships of the components of each interaction | |
CN111314443A (en) | Node processing method, device and device and medium based on distributed storage system | |
US11675647B2 (en) | Determining root-cause of failures based on machine-generated textual data | |
WO2017190339A1 (en) | Fault processing method and device | |
US9935867B2 (en) | Diagnostic service for devices that employ a device agent | |
CN106209456B (en) | A kind of kernel state lower network fault detection method and device | |
CN116016266A (en) | Health check implementation method and device based on API gateway | |
US9172596B2 (en) | Cross-network listening agent for network entity monitoring | |
CN107682185A (en) | MANO management methods and device | |
US12314119B2 (en) | Distributed hardware and software component monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220222 Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province Patentee after: Huawei Cloud Computing Technologies Co.,Ltd. Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221206 Address after: 518129 Huawei Headquarters Office Building 101, Wankecheng Community, Bantian Street, Longgang District, Shenzhen, Guangdong Patentee after: Shenzhen Huawei Cloud Computing Technology Co.,Ltd. Address before: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province Patentee before: Huawei Cloud Computing Technologies Co.,Ltd. |
|
TR01 | Transfer of patent right |