CN106713014B

CN106713014B - Monitored host in monitoring system, monitoring system and monitoring method

Info

Publication number: CN106713014B
Application number: CN201611088934.0A
Authority: CN
Inventors: 唐德平
Original assignee: Huawei Technologies Co Ltd
Current assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2020-01-10
Anticipated expiration: 2036-11-30
Also published as: CN106713014A

Abstract

A monitored host, a monitoring system and a monitoring method are used to provide the monitoring host with the ability to interact with an external business system. The monitored host includes a monitoring client, an agent tool module and a business module, the agent tool module provides a service interface to the business module, the business module records a key associated with each business failure category, and one-to-one keys associated with the key The corresponding fault description parameter set; when the business module fails to interact with the business system outside the host, the business module sends the key corresponding to the business failure and the value of the fault description parameter corresponding to this failure to the proxy tool module through the service interface. value; the proxy tool module writes the value of each fault description parameter into the template file corresponding to the key, generates and reports monitoring information to the monitoring server. Through the above method, it is possible to monitor the service failure caused by the failure of the non-monitored host.

Description

A monitored host in a monitoring system, a monitoring system and a monitoring method

技术领域technical field

本申请涉及网络技术领域，特别涉及一种监控系统中的被监控主机、监控系统以及监控方法。The present application relates to the field of network technologies, and in particular, to a monitored host in a monitoring system, a monitoring system, and a monitoring method.

背景技术Background technique

Zabbix是一个开源分布式监控系统，可以对网络设备进行数据监控。如图1所示，Zabbix监控系统中包括服务主机和若干个被监控主机，图1中仅显示一个被监控主机。服务主机中包括Zabbix网络(即web)图形用户界面(Graphical User Interface，GUI)，Zabbix数据库和Zabbix服务端。Zabbix实现的一种设备监控方案中，在被监控主机中安装Zabbix客户端和监控脚本。用户通过Zabbix网络GUI在Zabbix服务端中添加监控项等一些配置信息，在监控客户端的配置文件中配置监控项的key和对应的监控脚本。Zabbix客户端会从Zabbix服务端中同步监控项等一些配置信息，根据这些配置信息调度对应的监控脚本采集监控数据，并把采集到的监控数据上报给Zabbix服务端。Zabbix服务端将收到的监控数据存入到Zabbix数据库，用户通过Zabbix网络GUI可以查看监控数据的结果。Zabbix is an open source distributed monitoring system that enables data monitoring of network devices. As shown in Figure 1, the Zabbix monitoring system includes a service host and several monitored hosts, and only one monitored host is shown in Figure 1. The service host includes a Zabbix network (ie web) graphical user interface (Graphical User Interface, GUI), a Zabbix database and a Zabbix server. In a device monitoring solution implemented by Zabbix, the Zabbix client and monitoring scripts are installed on the monitored host. The user adds some configuration information such as monitoring items in the Zabbix server through the Zabbix network GUI, and configures the key of the monitoring item and the corresponding monitoring script in the configuration file of the monitoring client. The Zabbix client will synchronize some configuration information such as monitoring items from the Zabbix server, schedule the corresponding monitoring script to collect monitoring data according to the configuration information, and report the collected monitoring data to the Zabbix server. The Zabbix server stores the received monitoring data into the Zabbix database, and users can view the results of the monitoring data through the Zabbix network GUI.

被监控主机上运行有服务进程，用户通过服务进程与主机外面的业务系统交互，所述业务系统可以为通信系统、数据库系统或者web服务系统等等。由于Zabbix系统只对被监控主机自身进行监控，当被监控主机与业务系统之间的链路发生故障或者业务系统发生故障时，Zabbix系统无法及时发现该故障。例如，被监控主机可以与通信系统相连，被监控主机可以连接到运营商的短消息网关，使用所述被监控主机的用户通过短消息网关发送短消息。但是被监控主机与短消息网关之间的通信链路发生故障或者短消息网关本身发生故障，用户无法通过被监控主机发送短消息，从而造成用户的短消息业务失败，由于被监控主机自身没有发生故障，Zabbix系统无法及时获知用户的短消息业务故障，从而无法及时向管理员或用户上报短消息故障信息。A service process runs on the monitored host, and the user interacts with a business system outside the host through the service process, and the business system may be a communication system, a database system, or a web service system, and so on. Since the Zabbix system only monitors the monitored host itself, when the link between the monitored host and the business system fails or the business system fails, the Zabbix system cannot detect the fault in time. For example, the monitored host may be connected to a communication system, the monitored host may be connected to an operator's short message gateway, and users using the monitored host send short messages through the short message gateway. However, if the communication link between the monitored host and the short message gateway fails or the short message gateway itself fails, the user cannot send short messages through the monitored host, thus causing the user's short message service to fail. If there is a fault, the Zabbix system cannot know the user's short message service fault in time, so it cannot report the short message fault information to the administrator or user in time.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供一种监控系统中的被监控主机、监控系统以及监控方法，用以解决被监控主机与业务系统之间的链路发生故障或者业务系统发生故障时，监控系统无法及时发现该故障的问题。Embodiments of the present application provide a monitored host in a monitoring system, a monitoring system, and a monitoring method, so as to solve the problem that when a link between a monitored host and a business system fails or a business system fails, the monitoring system cannot detect the failure in time. problem of failure.

本申请实施例提供的具体技术方案如下：第一方面，提供一种监控系统中的被监控主机，该被监控主机包括监控客户端、代理工具模块以及业务模块，所述代理工具模块向所述业务模块提供服务接口，所述业务模块记录每种业务失败类别关联的一个key，以及与key一一对应的故障描述参数集；代理工具模块记录key与模板文件的对应关系，所述模板文件中包括所述key对应的故障描述参数集；当业务模块与主机外部的业务系统交互失败时，所述业务模块通过所述服务接口向代理工具模块发送业务失败对应的key以及本次失败对应的故障描述参数的取值；代理工具模块将所述各个故障描述参数的取值写入所述key对应的模板文件，生成并向监控服务端上报监控信息。本发明实施例通过在被监控主机中新增代理工具模块，由代理工具模块向业务模块提供服务接口，定义了业务模块在业务失败后的故障上报流程，实现了监控系统对非被监控主机故障引起的业务失败的监控。业务模块无需与监控系统耦合，业务模块只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。The specific technical solutions provided by the embodiments of the present application are as follows: In a first aspect, a monitored host in a monitoring system is provided, where the monitored host includes a monitoring client, an agent tool module, and a business module, and the agent tool module reports to the The business module provides a service interface, the business module records a key associated with each type of business failure, and a set of failure description parameters corresponding to the key one-to-one; the proxy tool module records the corresponding relationship between the key and the template file, in the template file Including the failure description parameter set corresponding to the key; when the interaction between the business module and the business system outside the host fails, the business module sends the key corresponding to the business failure and the failure corresponding to this failure to the proxy tool module through the service interface. The value of the description parameter; the proxy tool module writes the value of each fault description parameter into the template file corresponding to the key, and generates and reports monitoring information to the monitoring server. In the embodiment of the present invention, by adding an agent tool module to the monitored host, the agent tool module provides a service interface to the business module, defines the fault reporting process of the business module after the business fails, and realizes the monitoring system for the failure of the non-monitored host. Monitoring of business failures caused. The business module does not need to be coupled with the monitoring system. The business module only needs to define the key and all the fault description parameters in JSON format required by the abnormal business scenario according to its own business exception.

在一种可能的设计中，所述代理工具模块在将所述各个故障描述参数的取值写入模板文件之后，生成value，所述value为所述各个故障描述参数的取值对应的字符串；相应地，所述监控信息包括本次业务失败的类别对应的key以及所述value。In a possible design, the proxy tool module generates a value after writing the value of each fault description parameter into the template file, where the value is a character string corresponding to the value of each fault description parameter ; Correspondingly, the monitoring information includes the key corresponding to the category of this service failure and the value.

在另一种可能的设计中，所述代理工具模块可以通过调用所述监控客户端的命令行工具，将所述监控信息上报给所述监控服务端；或者，所述代理工具模块将所述监控信息发送给所述监控客户端，以使得所述监控客户端将所述监控信息发送给所述监控服务端。In another possible design, the proxy tool module may report the monitoring information to the monitoring server by invoking a command line tool of the monitoring client; or, the proxy tool module may report the monitoring information to the monitoring server. The information is sent to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.

在另一种可能的设计中，所述代理工具模块通过所述服务接口向所述业务模块提供本地环回地址，接收所述业务模块通过HTTP方式传递的所述业务故障信息。In another possible design, the proxy tool module provides a local loopback address to the service module through the service interface, and receives the service failure information transmitted by the service module through HTTP.

所述故障描述参数可以采用JSON对象。The fault description parameter may use a JSON object.

在一种可能的场景中，本发明实施例通过对模板文件和JSON对象的组合变化，可以根据需要自定义监控信息的内容和格式，向监控服务端上报自定义的监控信息，方便系统管理员查看详细的业务异常情况。再一方面，由于业务模块与代理工具模块之间采用无状态的HTTP通信，即使监控系统的进程发生故障，也不会影响到业务模块，从而不会对用户的业务产生影响，保障了业务的安全。In a possible scenario, the embodiment of the present invention can customize the content and format of the monitoring information as required by changing the combination of the template file and the JSON object, and report the customized monitoring information to the monitoring server, which is convenient for the system administrator View detailed business exceptions. On the other hand, since the stateless HTTP communication is used between the business module and the agent tool module, even if the process of the monitoring system fails, it will not affect the business module, thus not affecting the user's business and ensuring the business continuity. Safety.

所述代理工具模块，还可以执行流控策略，限定同一类别的业务失败的上报频率。所述流控策略包括限定同一key值对应的监控信息的上报频率不大于预设值。The proxy tool module may also implement a flow control policy to limit the reporting frequency of service failures of the same category. The flow control policy includes limiting the reporting frequency of monitoring information corresponding to the same key value to be no greater than a preset value.

所述代理工具模块与所述监控客户端合设。The agent tool module is co-located with the monitoring client.

第二方面，提供一种监控系统，包括：监控客户端、代理工具模块以及监控服务端，所述监控客户端以及所述代理工具模块运行在被监控主机上，所述代理工具模块向业务模块提供服务接口；其中，所述所述代理工具模块具有实现上述第一方面中所述的所述代理工具模块的功能。In a second aspect, a monitoring system is provided, including: a monitoring client, an agent tool module, and a monitoring server, the monitoring client and the agent tool module run on a monitored host, and the agent tool module reports to a business module A service interface is provided; wherein, the proxy tool module has the function of implementing the proxy tool module described in the first aspect above.

第三方面，提供一种监控方法，与前述第一方面相对应，业务模块、代理工具模块以及监控服务端执行第一方面中的对应模块的功能。A third aspect provides a monitoring method. Corresponding to the aforementioned first aspect, the business module, the agent tool module, and the monitoring server perform the functions of the corresponding modules in the first aspect.

第四方面，提供了另一种监控系统中的被监控主机，该监控系统中的被监控主机具有实现上述第一方面和任一种可能的设计中被监控主机行为的功能。所述功能可以通过硬件实现，也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In a fourth aspect, a monitored host in another monitoring system is provided, and the monitored host in the monitoring system has the function of implementing the behavior of the monitored host in the first aspect and any possible design. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions.

在一个可能的设计中，该监控系统中的被监控主机包括收发器和处理器，其中，处理器用于调用一组程序代码，以执行如第二方面和任一种可能的设计中所述的方法。In one possible design, the monitored host in the monitoring system includes a transceiver and a processor, wherein the processor is configured to invoke a set of program codes to execute as described in the second aspect and any of the possible designs method.

第五方面，提供了一种计算机存储介质，用于储存为上述方面所述的被监控主机所用的计算机软件指令，其包含用于执行上述方面所设计的程序。In a fifth aspect, a computer storage medium is provided for storing computer software instructions used by the monitored host described in the above-mentioned aspects, including the program designed for executing the above-mentioned aspects.

附图说明Description of drawings

图1为现有技术中Zabbix监控系统架构图；Fig. 1 is the Zabbix monitoring system architecture diagram in the prior art;

图2为本申请实施例中监控系统架构图；FIG. 2 is an architecture diagram of a monitoring system in an embodiment of the present application;

图3为本申请实施例中监控方法的流程示意图；3 is a schematic flowchart of a monitoring method in an embodiment of the present application;

图4为本申请实施例中监控系统中的被监控主机硬件结构示意图。FIG. 4 is a schematic diagram of the hardware structure of the monitored host in the monitoring system according to the embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请作进一步描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be further described below with reference to the accompanying drawings.

如图2所示，为本发明实施例提供的一种监控系统结构示意图，该监控系统包括被监控主机11和监控服务端12，监控系统通过网络13与业务系统14相连，具体的，被监控主机包括监控客户端111、代理工具模块112以及业务模块113。其中，所述代理工具模块112向所述业务模块113提供服务接口，所述业务模块113通过所述服务接口与所述代理工具模块112交互，所述业务模块113通过调用业务系统14的接口与业务系统交互。As shown in FIG. 2, it is a schematic structural diagram of a monitoring system provided by an embodiment of the present invention. The monitoring system includes a monitored host 11 and a monitoring server 12, and the monitoring system is connected to a business system 14 through a network 13. Specifically, the monitored host The host includes a monitoring client 111 , an agent tool module 112 and a business module 113 . The proxy tool module 112 provides a service interface to the business module 113, the business module 113 interacts with the proxy tool module 112 through the service interface, and the business module 113 communicates with the proxy tool module 112 by calling the interface of the business system 14. business system interaction.

在一种可能的场景中，用户登录所述被监控主机11，运行主机11上的应用，该应用通过所述业务模块113接入到外部业务系统14，访问业务系统14提供的业务。例如，业务系统14可以为短消息系统，业务模块113通过网络13连接到业务系统14的短消息中心，通过短消息中心对外发送短消息。In a possible scenario, the user logs in to the monitored host 11 , runs an application on the host 11 , and the application accesses the external business system 14 through the business module 113 to access services provided by the business system 14 . For example, the service system 14 may be a short message system, and the service module 113 is connected to the short message center of the service system 14 through the network 13, and sends short messages externally through the short message center.

当主机11本身运行正常，主机与业务系统14之间的链路故障或者业务系统14故障，导致用户无法正常使用短消息业务时，监控服务端需要及时获知该故障的发生，为了实现上述目的，在本发明实施例中，在主机11上新增代理工具模块112，代理工具模块112提供服务接口给业务模块113，在业务模块113检测到业务失败时，通过所述服务接口向代理工具模块112上报业务失败信息，从而通过代理工具模块112实现将业务失败的监控信息上报给监控服务端。When the host 11 itself is running normally, the link between the host and the service system 14 or the service system 14 is faulty, so that the user cannot use the short message service normally, the monitoring server needs to be informed of the occurrence of the fault in time. In order to achieve the above purpose, In the embodiment of the present invention, an agent tool module 112 is newly added on the host 11, and the agent tool module 112 provides a service interface to the business module 113. When the business module 113 detects a business failure, the agent tool module 112 is sent to the agent tool module 112 through the service interface. The service failure information is reported, so that the monitoring information of the service failure is reported to the monitoring server through the proxy tool module 112 .

在一种可能的场景中，主机11可以为物理服务器集群中的任一物理服务器，该物理服务器集群可以为云计算物理服务器集群，对用户提供云服务；在另一种可能的场景中，主机11可以为独立的物理服务器。监控服务端12可以运行在独立的物理服务器上。In a possible scenario, the host 11 may be any physical server in a physical server cluster, and the physical server cluster may be a cloud computing physical server cluster, providing cloud services to users; in another possible scenario, the host 11 can be an independent physical server. The monitoring server 12 may run on an independent physical server.

在一种可能的设计中，所述业务模块113可以为服务进程，用于处理业务。示例性的，该业务模块113可以为与短消息中心通信的NS(notification service)进程。In a possible design, the business module 113 may be a service process for processing business. Exemplarily, the service module 113 may be an NS (notification service) process that communicates with the short message center.

所述业务模块113连接外部业务系统14，访问业务系统14提供的业务。当业务访问失败时，所述业务模块113确定业务失败的类别，以及业务失败的故障描述参数集。The service module 113 is connected to the external service system 14 to access services provided by the service system 14 . When the service access fails, the service module 113 determines the category of the service failure and the set of failure description parameters for the service failure.

需要说明的是，业务失败的类别表示引起业务失败的因素。示例性的，业务失败的类别可以包括链路故障、账号不合法或业务系统故障等等，故障描述参数集可以包括进程标识、业务系统地址以及故障指示等等可以准确描述业务失败原因的参数的集合。不同的业务失败类别可以对应不同的故障描述参数集。本领域技术人员可以理解的是，业务失败的类别和故障描述参数集可以根据不同的场景进行灵活的定义，本发明实施例并不将所述业务失败的类别限定为上述举例。It should be noted that the categories of business failures represent factors that cause business failures. Exemplarily, the categories of service failures may include link failures, invalid accounts, or service system failures, etc., and the failure description parameter set may include process identifiers, service system addresses, and failure indications, etc., which can accurately describe the cause of the service failure. gather. Different service failure categories may correspond to different sets of failure description parameters. Those skilled in the art can understand that the types of service failures and the set of fault description parameters can be flexibly defined according to different scenarios, and the embodiment of the present invention does not limit the types of service failures to the above examples.

进一步的，监控系统可以使用key-value的格式描述监控信息。此时，在监控系统开始运行前，所述业务模块113可以记录每种业务失败类别关联的一个key，以及与key一一对应的故障描述参数集，所述业务模块113将各个业务失败类别对应的key以及与key一一对应的故障描述参数集发送给代理工具模块112。在一种可能的设计中，针对每种业务失败类别分配一个key，所述key可以唯一标识该业务失败类别。例如，业务失败类别为连接超时、链路无响应、链路端口故障等等，对应的key可以自由设定，也可以依据任务的定义规则设定。Further, the monitoring system can describe monitoring information in a key-value format. At this time, before the monitoring system starts to run, the business module 113 can record a key associated with each business failure category, and a set of failure description parameters corresponding to the key one-to-one, and the business module 113 corresponds to each business failure category The key and the fault description parameter set corresponding to the key one-to-one are sent to the agent tool module 112. In a possible design, a key is allocated for each service failure category, and the key can uniquely identify the service failure category. For example, the service failure category is connection timeout, link no response, link port failure, etc. The corresponding key can be set freely, or can be set according to the definition rules of the task.

所述代理工具模块112记录key与模板文件的对应关系，所述模板文件中包括所述key对应的故障描述参数集。The proxy tool module 112 records the corresponding relationship between the key and the template file, and the template file includes the fault description parameter set corresponding to the key.

在一种可能的设计中，所述代理工具模块112对业务模块113提供RESTful接口，接口的请求内容可以是任意的JSON(JavaScript Object Notation)格式数据。所述代理工具模块112生成模板文件，故障描述参数集中的各个故障描述参数为模板文件中的元数据。一个业务失败类型可以对应一个模板文件。其中，JSON对象为一种轻量级的数据交换的语法格式，关于JSON的具体说明请参阅https://www.w3.org/TR/json-ld/。In a possible design, the proxy tool module 112 provides a RESTful interface to the business module 113, and the request content of the interface can be any data in JSON (JavaScript Object Notation) format. The proxy tool module 112 generates a template file, and each fault description parameter in the fault description parameter set is metadata in the template file. A business failure type can correspond to a template file. Among them, the JSON object is a lightweight data exchange syntax format, please refer to https://www.w3.org/TR/json-ld/ for the specific description of JSON.

所述业务模块113通过网络13与业务系统14交互，当业务失败时，通过所述服务接口向所述代理工具模块112发送业务故障信息，所述业务故障信息包括本次业务失败的类别对应的key以及故障描述参数集中各个故障描述参数的取值，其中，各个故障描述参数的取值可以准确表示本次业务失败的信息，包括业务名称以及失败原因等等，具体的，可以通过进程标识来表示业务名称。The business module 113 interacts with the business system 14 through the network 13, and when the business fails, sends business failure information to the proxy tool module 112 through the service interface, and the business failure information includes the category corresponding to the current business failure. The value of each fault description parameter in the key and the fault description parameter set, where the value of each fault description parameter can accurately represent the information of the current service failure, including the service name and failure reason, etc. Specifically, it can be identified by the process identifier. Indicates the business name.

所述代理工具模块112通过所述服务接口接收所述业务模块113发送的业务故障信息，根据所述对应关系查找所述key对应的模板文件，将所述各个故障描述参数的取值写入模板文件，生成监控信息。The proxy tool module 112 receives the service fault information sent by the service module 113 through the service interface, searches the template file corresponding to the key according to the corresponding relationship, and writes the value of each fault description parameter into the template file to generate monitoring information.

在一种可能的设计中，所述代理工具模块112读取业务模块113(服务进程)调用RESTful接口提供的JSON格式的数据，各个故障描述参数的取值写入到模板文件中，并生成最终的监控信息。具体的，所述代理工具模块112根据写入后的模板文件生成value，所述value为所述各个故障描述参数的取值对应的字符串，相应地，所述监控信息包括本次业务失败的类别对应的key以及所述value。In a possible design, the proxy tool module 112 reads the data in JSON format provided by the business module 113 (service process) calling the RESTful interface, writes the value of each fault description parameter into the template file, and generates a final monitoring information. Specifically, the proxy tool module 112 generates a value according to the written template file, and the value is a character string corresponding to the value of each fault description parameter. Correspondingly, the monitoring information includes the failure of this service. The key corresponding to the category and the value.

所述代理工具模块112通过所述监控客户端111将生成的监控信息上报给监控服务端12。The proxy tool module 112 reports the generated monitoring information to the monitoring server 12 through the monitoring client 111 .

所述代理工具模块112可以通过同步或异步的方式调用所述监控客户端111的命令行工具，将所述监控信息上报给所述监控服务端12；或者，所述代理工具模块112将所述监控信息发送给所述监控客户端111，以使得所述监控客户端111将所述监控信息发送给所述监控服务端12。The proxy tool module 112 can call the command line tool of the monitoring client 111 in a synchronous or asynchronous manner, and report the monitoring information to the monitoring server 12; The monitoring information is sent to the monitoring client 111 , so that the monitoring client 111 sends the monitoring information to the monitoring server 12 .

为提供监控系统的监控效率，避免同一故障重复、高频率的上报，所述代理工具模块112还可以具备流控功能，限制同一种监控信息重复发送的次数。例如，所述代理工具模块112执行流控策略，所述流控策略包括限定同一key值对应的监控信息的上报频率不大于预设值。本领域技术人员理解的是，上述预设值可以根据需求由系统管理员灵活设定，优选的，上报频率可以根据业务重要性进行设定。In order to improve the monitoring efficiency of the monitoring system and avoid repeated and high-frequency reporting of the same fault, the proxy tool module 112 may also have a flow control function to limit the number of times the same monitoring information is repeatedly sent. For example, the proxy tool module 112 executes a flow control policy, and the flow control policy includes limiting the reporting frequency of monitoring information corresponding to the same key value to be no greater than a preset value. It is understood by those skilled in the art that the above-mentioned preset value can be flexibly set by the system administrator according to requirements, and preferably, the reporting frequency can be set according to the importance of the service.

在一种可能的设计中，所述代理工具模块112可以单独部署，也可以与监控客户端111合并部署。In a possible design, the agent tool module 112 may be deployed independently, or may be deployed in combination with the monitoring client 111 .

所述代理工具模块112通过RESTful服务接口向所述业务模块113提供本地环回地址(如:127.0.0.1)，接收所述业务模块113通过超文本传输协议(HTTP，HyperTextTransfer Protocol)方式传递的业务故障信息。具体的，关于具象状态传输(Representational state transfer，REST)架构和RESTful接口的说明可以参阅https://zh.wikipedia.org/wiki/REST以及https://en.wikipedia.org/wiki/RESTful。The proxy tool module 112 provides a local loopback address (eg: 127.0.0.1) to the business module 113 through the RESTful service interface, and receives the business transmitted by the business module 113 through the HyperText Transfer Protocol (HTTP, HyperTextTransfer Protocol) mode accident details. Specifically, for the description of Representational state transfer (REST) architecture and RESTful interface, please refer to https://zh.wikipedia.org/wiki/REST and https://en.wikipedia.org/wiki/RESTful.

本申请实施例提供一种监控系统中的被监控主机11，在该主机11中新增了代理工具模块112(AgentTool)，通过上述方案，解决了解决业务模块113在业务失败后的故障上报问题。一方面，业务模块113无需与监控系统耦合，业务模块113只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。另一方面，通过对模板文件和JSON对象的组合变化，可以根据需要自定义监控信息的内容和格式，向监控服务端上报自定义的监控信息，方便系统管理员查看详细的业务异常情况。再一方面，由于业务模块113与代理工具模块112之间采用无状态的HTTP通信，即使监控系统的进程发生故障，也不会影响到业务模块，从而不会对用户的业务产生影响，保障了业务的安全。The embodiment of the present application provides a monitored host 11 in a monitoring system, an agent tool module 112 (AgentTool) is newly added to the host 11, and the above solution solves the problem of the service module 113 reporting faults after a service failure . On the one hand, the business module 113 does not need to be coupled with the monitoring system, and the business module 113 only needs to define the key and all the fault description parameters in JSON format required by the abnormal business scenario according to its own business exception scenario. On the other hand, by changing the combination of template files and JSON objects, the content and format of monitoring information can be customized as needed, and customized monitoring information can be reported to the monitoring server, so that system administrators can view detailed business exceptions. On the other hand, since the stateless HTTP communication is used between the business module 113 and the proxy tool module 112, even if the process of the monitoring system fails, the business module will not be affected, so that it will not affect the user's business. business security.

在本发明实施例中，所述监控系统可以为Zabbix系统，前述的命令行工具可以为Zabbix Sender，可以传递Key/Value参数。In this embodiment of the present invention, the monitoring system may be a Zabbix system, and the aforementioned command line tool may be Zabbix Sender, which may transmit Key/Value parameters.

基于图2所示的监控系统的架构，下面将对本申请实施例提供的监控方法进行说明。Based on the architecture of the monitoring system shown in FIG. 2 , the monitoring method provided by the embodiment of the present application will be described below.

参阅图3所示，为本申请实施例提供的一种监控方法。Referring to FIG. 3 , a monitoring method is provided in an embodiment of the present application.

步骤301：业务模块通过网络接入业务系统，根据业务可能存在的异常情况，确定业务失败类别对应的key以及业务失败的故障描述参数集。Step 301: The service module accesses the service system through the network, and determines the key corresponding to the service failure category and the set of failure description parameters of the service failure according to possible abnormal conditions of the service.

业务失败的类别表示引起业务失败的因素。示例性的，业务失败的类别可以包括链路故障、账号不合法或业务系统故障等等，故障描述参数集可以包括进程标识、业务系统地址以及故障指示等等可以准确描述业务失败原因的参数的集合。不同的业务失败类别可以对应不同的故障描述参数集。故障描述参数集中的各个故障描述参数可以采用JSON格式。The business failure category represents the factors that caused the business failure. Exemplarily, the categories of service failures may include link failures, invalid accounts, or service system failures, etc., and the failure description parameter set may include process identifiers, service system addresses, and failure indications, etc., which can accurately describe the cause of the service failure. gather. Different service failure categories may correspond to different sets of failure description parameters. Each fault description parameter in the fault description parameter set can be in JSON format.

示例如下：An example is as follows:

Key：smn-001-001Key: smn-001-001

JSON体：故障描述参数集JSON body: set of fault description parameters

{{

"Subject":"Channel Checking","Subject":"Channel Checking",

"ServiceName":"SMN-NS""ServiceName":"SMN-NS"

"ServiceAddress":"127.0.0.1""ServiceAddress": "127.0.0.1"

"Error":"Error""Error": "Error"

}}

步骤302：业务模块将将各个业务失败类别对应的key以及与key一一对应的故障描述参数集发送给代理工具模块。Step 302: The service module sends the key corresponding to each service failure category and the failure description parameter set corresponding to the key one-to-one to the agent tool module.

在一种可能的设计中，所述代理工具模块对业务模块提供RESTful接口。In a possible design, the proxy tool module provides a RESTful interface to the business module.

步骤303：代理工具模块接收业务模块发送的key以及与key一一对应的故障描述参数集，记录key与模板文件的对应关系，所述模板文件中包括所述key对应的故障描述参数集。Step 303: The proxy tool module receives the key sent by the business module and the fault description parameter set corresponding to the key one-to-one, and records the correspondence between the key and the template file, where the template file includes the fault description parameter set corresponding to the key.

所述代理工具模块生成模板文件，故障描述参数集中的各个故障描述参数为模板文件中的动态变量，用于表示元数据。一个业务失败类型可以对应一个模板文件。The proxy tool module generates a template file, and each fault description parameter in the fault description parameter set is a dynamic variable in the template file, which is used to represent metadata. A business failure type can correspond to a template file.

模板文件示例：Template file example:

步骤304：系统管理员通过监控系统的图形用户界面登录监控服务端，创建key以及监控指标。Step 304: The system administrator logs in to the monitoring server through the graphical user interface of the monitoring system, and creates keys and monitoring indicators.

在一种可能的设计中，代理工具模块可以将key以及故障描述参数集发送给监控服务端，监控服务端创建key以及监控指标，所述监控指标可以为文本格式，用于呈现接收到的监控信息。In a possible design, the agent tool module can send the key and the fault description parameter set to the monitoring server, and the monitoring server creates the key and monitoring indicators, and the monitoring indicators can be in text format for presenting the received monitoring information.

步骤305：业务模块通过网络访问业务系统，当发现业务失败时，调用代理工具模块提供的服务接口，向代理工具模块发送业务故障信息，所述业务故障信息包括本次业务失败的类别对应的key以及故障描述参数集中各个故障描述参数的取值，其中，各个故障描述参数的取值可以准确表示本次业务失败的信息，包括业务名称以及失败原因等等。Step 305: The business module accesses the business system through the network, and when a business failure is found, the service interface provided by the agent tool module is invoked, and the business failure information is sent to the agent tool module, where the business failure information includes the key corresponding to the category of the current business failure and the value of each fault description parameter in the fault description parameter set, wherein the value of each fault description parameter can accurately represent the information of the current service failure, including the service name and the failure reason.

步骤306：代理工具模块通过所述服务接口接收所述业务模块发送的业务故障信息，根据所述对应关系查找所述key对应的模板文件，将所述各个故障描述参数的取值写入模板文件，生成监控信息。Step 306: The agent tool module receives the service failure information sent by the service module through the service interface, searches for the template file corresponding to the key according to the corresponding relationship, and writes the values of the respective failure description parameters into the template file to generate monitoring information.

步骤307：代理工具模块将监控信息发送给监控服务端。Step 307: The agent tool module sends the monitoring information to the monitoring server.

具体的，代理工具模块读取故障描述参数集中JSON格式的数据，将各个故障描述参数的取值写入到模板文件中生成value，所述value为所述各个故障描述参数的取值对应的字符串。代理工具模块将业务失败类型对应的key以及生成的所述value作为命令行工具的入参，调用监控客户端的命令行工具将监控信息(即所述key以及生成的所述value)发送给监控服务端。Specifically, the proxy tool module reads the data in JSON format in the fault description parameter set, and writes the value of each fault description parameter into the template file to generate a value, where the value is the character corresponding to the value of each fault description parameter string. The proxy tool module uses the key corresponding to the business failure type and the generated value as the input parameters of the command line tool, and calls the command line tool of the monitoring client to send the monitoring information (that is, the key and the generated value) to the monitoring service end.

具体的，代理工具模块可以通过同步或异步的方式调用所述监控客户端的命令行工具，将所述监控信息上报给所述监控服务端；或者，所述代理工具模块将所述监控信息发送给所述监控客户端，以使得所述监控客户端将所述监控信息发送给所述监控服务端。Specifically, the proxy tool module may call the command line tool of the monitoring client in a synchronous or asynchronous manner, and report the monitoring information to the monitoring server; or, the proxy tool module may send the monitoring information to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.

所述代理工具模块与所述监控服务端之间可以采用基于JSON的远程调用协议(JSON-remote protocol call，JSON-RPC)。A JSON-based remote call protocol (JSON-remote protocol call, JSON-RPC) may be used between the proxy tool module and the monitoring server.

监控信息示例如下：Examples of monitoring information are as follows:

key:smn-001-001key: smn-001-001

value:value:

Senior Alert:Channel CheckingSenior Alert: Channel Checking

Components SMN-NS 127.0.0.1Components SMN-NS 127.0.0.1

Error：Can not connect channelError: Can not connect channel

步骤308：监控服务端接收监控信息，确定业务失败时，触发告警，通知系统管理员。Step 308: The monitoring server receives the monitoring information, and when it is determined that the service fails, triggers an alarm and notifies the system administrator.

本发明实施例通过在被监控主机中新增代理工具模块，由代理工具模块向业务模块提供服务接口，定义了业务模块113在业务失败后的故障上报流程，实现了监控系统对非被监控主机故障引起的业务失败的监控。一方面，业务模块无需与监控系统耦合，业务模块只需要根据自身的业务异常场景定义出key和该异常场景所需要的JSON格式的所有故障描述参数即可。另一方面，通过对模板文件和JSON对象的组合变化，可以根据需要自定义监控信息的内容和格式，向监控服务端上报自定义的监控信息，方便系统管理员查看详细的业务异常情况。再一方面，由于业务模块与代理工具模块之间采用无状态的HTTP通信，即使监控系统的进程发生故障，也不会影响到业务模块，从而不会对用户的业务产生影响，保障了业务的安全。In the embodiment of the present invention, by adding an agent tool module to the monitored host, the agent tool module provides a service interface to the business module, and defines the fault reporting process of the business module 113 after a business failure, so as to realize the monitoring system for the non-monitored host. Monitoring of business failures caused by failures. On the one hand, the business module does not need to be coupled with the monitoring system, and the business module only needs to define the key and all the fault description parameters in JSON format required by the abnormal business scenario according to its own business exception. On the other hand, by changing the combination of template files and JSON objects, the content and format of monitoring information can be customized as needed, and customized monitoring information can be reported to the monitoring server, so that system administrators can view detailed business exceptions. On the other hand, since the stateless HTTP communication is used between the business module and the agent tool module, even if the process of the monitoring system fails, it will not affect the business module, thus not affecting the user's business and ensuring the business continuity. Safety.

与前述监控系统和监控方法相对应，本发明实施例提供了一种被监控主机，包括前述的监控客户端、代理工具模块以及业务模块。被监控主机中的各模块执行前述监控系统以及监控方法中的功能，本发明实施例再次不再赘述。Corresponding to the aforementioned monitoring system and monitoring method, an embodiment of the present invention provides a monitored host, including the aforementioned monitoring client, an agent tool module, and a service module. Each module in the monitored host executes the functions in the aforementioned monitoring system and monitoring method, which will not be repeated in this embodiment of the present invention.

与前述监控系统和监控方法相对应，本发明实施例提供了另一种被监控主机，包括前述的监控客户端以及代理工具模块。被监控主机中的各模块执行前述监控系统以及监控方法中的功能，本发明实施例再次不再赘述。此时业务模块可以位于与被监控主机存在网络连接关系的另一主机上。Corresponding to the aforementioned monitoring system and monitoring method, an embodiment of the present invention provides another monitored host, including the aforementioned monitoring client and an agent tool module. Each module in the monitored host executes the functions in the aforementioned monitoring system and monitoring method, which will not be repeated in this embodiment of the present invention. At this time, the service module may be located on another host that has a network connection relationship with the monitored host.

基于同一发明构思，参阅图4所示，本申请实施例还提供了另一种监控系统中的被监控主机400，包括收发器401、处理器402、存储器403，收发器401、存储器403均与处理器402连接，需要说明的是图4所示的各部分之间的连接方式仅为一种可能的示例，也可以是，收发器401与存储器403均与处理器402连接，且收发器401与存储器403之间没有连接，或者，也可以是其他可能的连接方式。Based on the same inventive concept, referring to FIG. 4 , the embodiment of the present application further provides a monitored host 400 in another monitoring system, including a transceiver 401, a processor 402, and a memory 403. The transceiver 401 and the memory 403 are all the same as the The processor 402 is connected. It should be noted that the connection between the various parts shown in FIG. 4 is only a possible example. There is no connection with the storage 403, or other possible connection methods are also possible.

其中，存储器403中存储一组程序，处理器402用于调用存储器403中存储的程序，以执行前述图2以及图3所示的监控系统和监控方法中被监控主机的各模块的功能。The memory 403 stores a set of programs, and the processor 402 is used to call the programs stored in the memory 403 to execute the functions of the modules of the monitored host in the monitoring system and monitoring method shown in FIG. 2 and FIG. 3 .

在图4中，处理器402可以是中央处理器(英文：central processing unit，缩写：CPU)，网络处理器(英文：network processor，缩写：NP)或者CPU和NP的组合。In FIG. 4 , the processor 402 may be a central processing unit (English: central processing unit, abbreviation: CPU), a network processor (English: network processor, abbreviation: NP), or a combination of CPU and NP.

处理器402还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(英文：application-specific integrated circuit，缩写：ASIC)，可编程逻辑器件(英文：programmable logic device，缩写：PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(英文：complex programmable logic device，缩写：CPLD)，现场可编程逻辑门阵列(英文：field-programmable gate array，缩写：FPGA)，通用阵列逻辑(英文：generic arraylogic,缩写：GAL)或其任意组合。The processor 402 may further include hardware chips. The above-mentioned hardware chip may be an application-specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), a programmable logic device (English: programmable logic device, abbreviation: PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviation: FPGA), a general array logic (English: generic arraylogic , abbreviation: GAL) or any combination thereof.

存储器401可以包括易失性存储器(英文：volatile memory)，例如随机存取存储器(英文：random-access memory，缩写：RAM)；存储器401也可以包括非易失性存储器(英文：non-volatile memory)，例如快闪存储器(英文：flash memory)，硬盘(英文：hard diskdrive，缩写：HDD)或固态硬盘(英文：solid-state drive，缩写：SSD)；存储器401还可以包括上述种类的存储器的组合。The memory 401 may include a volatile memory (English: volatile memory), such as random-access memory (English: random-access memory, abbreviation: RAM); the memory 401 may also include a non-volatile memory (English: non-volatile memory) ), such as flash memory (English: flash memory), hard disk (English: hard diskdrive, abbreviation: HDD) or solid-state drive (English: solid-state drive, abbreviation: SSD); the memory 401 may also include the above-mentioned types of memory. combination.

监控服务端所在的物理服务器也可以采用如图4所示的硬件结构。本发明实施例不再赘述。The physical server where the monitoring server is located may also adopt the hardware structure shown in FIG. 4 . This embodiment of the present invention will not be described repeatedly.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

这些计算机程序代码可以存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中。Such computer program codes may be stored in computer readable memory which directs a computer or other programmable data processing device to function in a particular manner.

显然，本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样，倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. A monitored host, comprising: a monitoring client, an agent tool module and a business module, wherein the agent tool module provides a service interface for the business module,

the service module is used for recording a key associated with each service failure category and fault description parameter sets corresponding to the keys one by one;

the agent tool module is used for recording the corresponding relation between a key and a template file, and the template file comprises a fault description parameter set corresponding to the key;

the service module is also used for interacting with a service system through a network, and when a service fails, service fault information is sent to the agent tool module through the service interface, wherein the service fault information comprises keys corresponding to the type of the service failure and values of fault description parameters in a fault description parameter set;

the agent tool module is further configured to receive service fault information sent by the service module through the service interface, search a template file corresponding to the key according to the correspondence, write values of the fault description parameters into the template file, and generate monitoring information;

and the agent tool module is also used for reporting the generated monitoring information to the monitoring server through the monitoring client.

2. The monitored host according to claim 1, wherein the agent module is further configured to generate a value after writing the value of each fault description parameter into a template file, where the value is a character string corresponding to the value of each fault description parameter;

correspondingly, the monitoring information includes the key corresponding to the service failure category and the value.

3. The monitored host of claim 1,

the agent tool module is specifically used for calling a command line tool of the monitoring client and reporting the monitoring information to the monitoring server; or,

the agent tool module is specifically configured to send the monitoring information to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.

4. The monitored host according to any one of claims 1-3, wherein the agent module is specifically configured to provide a local loopback address to the traffic module through the service interface, and receive the traffic failure information transmitted by the traffic module in a hypertext transfer protocol (HTTP) manner.

5. A monitored host according to any one of claims 1-3,

the agent tool module is further configured to execute a flow control strategy, where the flow control strategy includes that the reporting frequency of monitoring information corresponding to the same key value is limited to be not greater than a preset value.

6. A monitored host as claimed in any one of claims 1 to 3, wherein the service interface employs HTTP; the fault description parameters are object notation JSON objects.

7. A monitored host as claimed in any one of claims 1 to 3, wherein said agent module is co-located with said monitoring client.

8. A monitoring system is characterized by comprising a monitoring client, an agent tool module and a monitoring server, wherein the monitoring client and the agent tool module run on a monitored host, the agent tool module provides a service interface for a business module,

the agent tool module is used for recording the corresponding relation between the keys and a template file, wherein the template file comprises a fault description parameter set corresponding to the keys, and each key corresponds to a service failure category;

the agent tool module is also used for receiving service fault information sent by the service module through the service interface, wherein the service fault information comprises keys corresponding to the types of the service failure and values of all fault description parameters in the fault description parameter set;

the agent tool module is further configured to search a template file corresponding to the key according to the corresponding relationship, write the fault description parameter into the template file, generate monitoring information, and report the generated monitoring information to the monitoring server through the monitoring client;

and the monitoring server is used for receiving the monitoring information.

9. The monitoring system according to claim 8, wherein the agent module is further configured to generate a value after writing the value of each fault description parameter into a template file, where the value is a character string corresponding to the value of each fault description parameter;

10. A monitoring system in accordance with claim 8,

11. A monitoring system according to any one of claims 8-10,

the agent tool module is specifically configured to provide a local loopback address to the service module through the service interface, and receive the service fault information transmitted by the service module in an HTTP manner.

12. A monitoring system according to any one of claims 8-10,

13. A method for monitoring a service, comprising:

the service module records a key associated with each service failure category and a fault description parameter set corresponding to the key one by one, and sends the key corresponding to each service failure category and the fault description parameter set corresponding to the key one by one to the agent tool module;

the agent tool module records the corresponding relation between the key and the template file, and the template file comprises a fault description parameter set corresponding to the key;

the service module interacts with a service system through a network, and when a service fails, service fault information is sent to the agent tool module through the service interface, wherein the service fault information comprises keys corresponding to the type of the service failure and values of fault description parameters in a fault description parameter set;

the agent tool module receives the service fault information sent by the service module through the service interface, searches the template file corresponding to the key according to the corresponding relation, writes the value of each fault description parameter into the template file, and generates monitoring information;

and the agent tool module reports the generated monitoring information to the monitoring server through the monitoring client.

14. The monitoring method according to claim 13, wherein the value is a character string corresponding to the value of each fault description parameter, and accordingly, the monitoring information includes a key corresponding to the category in which the current service fails and the value.

15. The monitoring method of claim 13, wherein the agent module reporting the generated monitoring information to the monitoring server through the monitoring client comprises:

the agent tool module calls a command line tool of the monitoring client and reports the monitoring information to the monitoring server; or,

and the agent tool module sends the monitoring information to the monitoring client so that the monitoring client sends the monitoring information to the monitoring server.

16. The monitoring method according to any one of claims 13-15, wherein the receiving, by the agent module, the service failure information sent by the service module via the service interface comprises:

the agent tool module provides a local loopback address for the business module through the service interface and receives the business fault information transmitted by the business module in an HTTP mode.

17. A method of monitoring as claimed in any of claims 13 to 15, the method further comprising:

and the agent tool module executes a flow control strategy, wherein the flow control strategy comprises that the reporting frequency of the monitoring information corresponding to the same key value is limited to be not more than a preset value.