Disclosure of Invention
The embodiment of the application provides a monitored host, a monitoring system and a monitoring method in a monitoring system, which are used for solving the problem that the monitoring system cannot find a fault in time when a link between the monitored host and a service system fails or the service system fails.
The embodiment of the application provides the following specific technical scheme: in a first aspect, a monitored host in a monitoring system is provided, where the monitored host includes a monitoring client, an agent module and a service module, the agent module provides a service interface to the service module, and the service module records a key associated with each service failure category and a failure description parameter set corresponding to the key one to one; the agent tool module records the corresponding relation between the key and the template file, and the template file comprises a fault description parameter set corresponding to the key; when the interaction between the business module and a business system outside the host fails, the business module sends a key corresponding to the business failure and a value of a fault description parameter corresponding to the failure to the agent tool module through the service interface; and the agent tool module writes the values of the fault description parameters into the template file corresponding to the key, generates and reports monitoring information to the monitoring server. The embodiment of the invention defines the fault reporting flow of the business module after the business failure by adding the agent tool module in the monitored host and providing the service interface for the business module by the agent tool module, thereby realizing the monitoring of the monitoring system on the business failure caused by the fault of the non-monitored host. The service module does not need to be coupled with the monitoring system, and the service module only needs to define keys and all fault description parameters of the JSON format required by the abnormal scene according to the abnormal scene of the service module.
In a possible design, after writing the values of the fault description parameters into a template file, the agent tool module generates values, where the values are character strings corresponding to the values of the fault description parameters; correspondingly, the monitoring information includes the key corresponding to the service failure category and the value.
In another possible design, the agent tool module may report the monitoring information to the monitoring server by calling a command line tool of the monitoring client; or, the agent module sends the monitoring information to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.
In another possible design, the agent module provides a local loopback address to the service module through the service interface, and receives the service fault information transmitted by the service module in an HTTP manner.
The fault description parameters can adopt JSON objects.
In a possible scenario, the embodiment of the invention can customize the content and format of the monitoring information according to the needs by combining and changing the template file and the JSON object, and report the customized monitoring information to the monitoring server, thereby facilitating the system administrator to check the detailed abnormal conditions of the service. On the other hand, because stateless HTTP communication is adopted between the service module and the agent tool module, even if the process of the monitoring system breaks down, the service module is not influenced, so that the service of the user is not influenced, and the safety of the service is ensured.
The agent tool module can also execute a flow control strategy and limit the reporting frequency of the same type of service failure. And the flow control strategy comprises the step of limiting the reporting frequency of the monitoring information corresponding to the same key value to be not more than a preset value.
The agent tool module is combined with the monitoring client.
In a second aspect, there is provided a monitoring system comprising: the monitoring client and the agent tool module run on a monitored host, and the agent tool module provides a service interface for the business module; wherein the agent module has a function of implementing the agent module described in the above first aspect.
In a third aspect, a monitoring method is provided, where corresponding to the first aspect, the service module, the agent module, and the monitoring server execute functions of corresponding modules in the first aspect.
In a fourth aspect, there is provided another monitored host in a monitoring system, wherein the monitored host has a function of implementing the behavior of the monitored host in the first aspect and any one of the possible designs. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the monitored host in the monitoring system includes a transceiver and a processor, wherein the processor is configured to invoke a set of program code to perform the method as described in the second aspect and any one of the possible designs.
In a fifth aspect, there is provided a computer storage medium for storing computer software instructions for a monitored host of the above aspects, comprising a program designed for executing the above aspects.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described with reference to the accompanying drawings.
As shown in fig. 2, a schematic structural diagram of a monitoring system provided in an embodiment of the present invention includes a monitored host 11 and a monitoring server 12, where the monitoring system is connected to a service system 14 through a network 13, and specifically, the monitored host includes a monitoring client 111, an agent module 112, and a service module 113. The agent module 112 provides a service interface to the business module 113, the business module 113 interacts with the agent module 112 through the service interface, and the business module 113 interacts with the business system by calling an interface of the business system 14.
In one possible scenario, a user logs in to the monitored host 11, runs an application on the host 11, accesses to the external service system 14 through the service module 113, and accesses a service provided by the service system 14. For example, the service system 14 may be a short message system, and the service module 113 is connected to a short message center of the service system 14 through the network 13, and sends a short message to the outside through the short message center.
In order to achieve the above object, in the embodiment of the present invention, an agent module 112 is newly added to the host 11, the agent module 112 provides a service interface to the service module 113, and when the service module 113 detects a service failure, service failure information is reported to the agent module 112 through the service interface, so that the monitoring information of the service failure is reported to the monitoring server through the agent module 112.
In one possible scenario, the host 11 may be any physical server in a physical server cluster, which may be a cloud computing physical server cluster providing cloud services to users; in another possible scenario, the host 11 may be a stand-alone physical server. The monitoring server 12 may run on a separate physical server.
In one possible design, the business module 113 may be a service process for processing business. Illustratively, the service module 113 may be an ns (notification service) process that communicates with a short message center.
The service module 113 is connected to the external service system 14, and accesses services provided by the service system 14. When the service access fails, the service module 113 determines the category of the service failure and the set of fault description parameters of the service failure.
It should be noted that the category of the service failure represents a factor causing the service failure. For example, the category of the service failure may include a link failure, an account number failure or a service system failure, and the like, and the set of failure description parameters may include a set of parameters that can accurately describe the cause of the service failure, such as a process identifier, a service system address, and a failure indication. Different traffic failure categories may correspond to different sets of failure description parameters. It can be understood by those skilled in the art that the category of the service failure and the set of the failure description parameter may be flexibly defined according to different scenarios, and the embodiment of the present invention does not limit the category of the service failure to the above example.
Further, the monitoring system may describe the monitoring information using a key-value format. At this time, before the monitoring system starts to operate, the service module 113 may record a key associated with each service failure category and a set of fault description parameters corresponding to the key one to one, and the service module 113 sends the key corresponding to each service failure category and the set of fault description parameters corresponding to the key one to the agent module 112. In one possible design, a key is assigned for each traffic failure category that may uniquely identify the traffic failure category. For example, the service failure category is connection timeout, no link response, link port failure, and the like, and the corresponding key may be set freely or according to the definition rule of the task.
The agent tool module 112 records a correspondence between a key and a template file, where the template file includes a fault description parameter set corresponding to the key.
In one possible design, the agent module 112 provides a RESTful interface to the business module 113, and the request content of the interface may be arbitrary json (javascript Object notification) format data. The agent tool module 112 generates a template file, and each fault description parameter in the fault description parameter set is metadata in the template file. One service failure type may correspond to one template file. The JSON object is a lightweight syntax format for data exchange, and for the detailed description of JSON, see https:// www.w.3. org/TR/JSON-ld/.
The service module 113 interacts with the service system 14 through the network 13, and when a service fails, sends service failure information to the agent module 112 through the service interface, where the service failure information includes keys corresponding to the type of the service failure and values of each failure description parameter in the failure description parameter set, where the values of each failure description parameter can accurately represent information of the service failure, including a service name, a failure reason, and the like, and specifically, the service name can be represented by a process identifier.
The agent tool module 112 receives the service fault information sent by the service module 113 through the service interface, searches the template file corresponding to the key according to the correspondence, writes the value of each fault description parameter into the template file, and generates monitoring information.
In a possible design, the agent tool module 112 reads data in JSON format provided by the service module 113 (service process) calling the RESTful interface, writes values of each fault description parameter into a template file, and generates final monitoring information. Specifically, the agent tool module 112 generates a value according to the written template file, where the value is a character string corresponding to the value of each fault description parameter, and correspondingly, the monitoring information includes a key corresponding to the type of the service failure and the value.
The agent module 112 reports the generated monitoring information to the monitoring server 12 through the monitoring client 111.
The agent tool module 112 may call a command line tool of the monitoring client 111 in a synchronous or asynchronous manner, and report the monitoring information to the monitoring server 12; alternatively, the agent module 112 sends the monitoring information to the monitoring client 111, so that the monitoring client 111 sends the monitoring information to the monitoring server 12.
In order to improve the monitoring efficiency of the monitoring system and avoid repeated and high-frequency reporting of the same fault, the agent module 112 may further have a flow control function to limit the number of times of repeated sending of the same type of monitoring information. For example, the agent module 112 executes a flow control policy, where the flow control policy includes that the reporting frequency of the monitoring information corresponding to the same key value is not greater than a preset value. Those skilled in the art understand that the preset value can be flexibly set by a system administrator according to requirements, and preferably, the reporting frequency can be set according to the service importance.
In one possible design, the agent module 112 may be deployed alone or in combination with the monitoring client 111.
The agent module 112 provides a local loopback address (e.g., 127.0.0.1) to the service module 113 through a RESTful service interface, and receives service failure information transmitted by the service module 113 through a hypertext transfer Protocol (HTTP). In particular, for a description of a Representational state transfer (REST) architecture and RESTful interface, see https:// zh.
The embodiment of the present application provides a monitored host 11 in a monitoring system, and an agent module 112(AgentTool) is added in the host 11, and by the above scheme, the problem of reporting a fault after a service failure of a service module 113 is solved. On one hand, the service module 113 does not need to be coupled with the monitoring system, and the service module 113 only needs to define keys and all fault description parameters in the JSON format required by the abnormal scene according to the abnormal scene of the service. On the other hand, by combining and changing the template file and the JSON object, the content and the format of the monitoring information can be customized according to needs, the customized monitoring information can be reported to the monitoring server, and a system administrator can conveniently check detailed abnormal conditions of the service. On the other hand, because stateless HTTP communication is used between the service module 113 and the agent module 112, even if a process of the monitoring system fails, the service module is not affected, so that the service of the user is not affected, and the service security is ensured.
In the embodiment of the present invention, the monitoring system may be a Zabbix system, and the command line tool may be a Zabbix Sender, which may transmit a Key/Value parameter.
Based on the architecture of the monitoring system shown in fig. 2, the monitoring method provided in the embodiment of the present application will be described below.
Referring to fig. 3, a monitoring method according to an embodiment of the present application is shown.
Step 301: the service module accesses the service system through the network, and determines the key corresponding to the service failure category and the fault description parameter set of the service failure according to the possible abnormal condition of the service.
The category of service failure represents a factor causing the service failure. For example, the category of the service failure may include a link failure, an account number failure or a service system failure, and the like, and the set of failure description parameters may include a set of parameters that can accurately describe the cause of the service failure, such as a process identifier, a service system address, and a failure indication. Different traffic failure categories may correspond to different sets of failure description parameters. Each fault description parameter in the set of fault description parameters may be in JSON format.
Examples are as follows:
Key:smn-001-001
the JSON body: set of fault description parameters
{
"Subject":"Channel Checking",
"ServiceName":"SMN-NS"
"ServiceAddress":"127.0.0.1"
"Error":"Error"
}
Step 302: and the service module sends the keys corresponding to the service failure categories and the fault description parameter sets corresponding to the keys one to the agent tool module.
In one possible design, the agent module provides a RESTful interface to the traffic module.
Step 303: and the agent tool module receives the keys sent by the service module and the fault description parameter sets corresponding to the keys one by one, and records the corresponding relation between the keys and the template file, wherein the template file comprises the fault description parameter sets corresponding to the keys.
The agent tool module generates a template file, and each fault description parameter in the fault description parameter set is a dynamic variable in the template file and used for representing metadata. One service failure type may correspond to one template file.
Template file example:
step 304: and a system administrator logs in the monitoring server through a graphical user interface of the monitoring system to create a key and a monitoring index.
In one possible design, the agent module may send the key and the set of fault description parameters to the monitoring server, and the monitoring server creates the key and a monitoring index, where the monitoring index may be in a text format and used to present the received monitoring information.
Step 305: the service module accesses the service system through the network, when a service failure is found, a service interface provided by the agent tool module is called, and service failure information is sent to the agent tool module, wherein the service failure information comprises keys corresponding to the type of the service failure and values of all failure description parameters in a failure description parameter set, and the values of all the failure description parameters can accurately represent the information of the service failure, including a service name, failure reasons and the like.
Step 306: and the agent tool module receives the service fault information sent by the service module through the service interface, searches the template file corresponding to the key according to the corresponding relation, writes the value of each fault description parameter into the template file, and generates monitoring information.
Step 307: and the agent tool module sends the monitoring information to the monitoring server.
Specifically, the agent tool module reads data in a JSON format in the fault description parameter set, and writes the value of each fault description parameter into a template file to generate a value, where the value is a character string corresponding to the value of each fault description parameter. And the agent tool module takes the key corresponding to the service failure type and the generated value as the input parameters of the command line tool, and calls the command line tool of the monitoring client to send monitoring information (namely the key and the generated value) to the monitoring server.
Specifically, the agent tool module may call a command line tool of the monitoring client in a synchronous or asynchronous manner, and report the monitoring information to the monitoring server; or, the agent module sends the monitoring information to the monitoring client, so that the monitoring client sends the monitoring information to the monitoring server.
The agent tool module and the monitoring server can adopt a JSON-remote protocol call (JSON-RPC) based remote call protocol.
Examples of monitoring information are as follows:
key:smn-001-001
value:
Senior Alert:Channel Checking
Components SMN-NS 127.0.0.1
Error:Can not connect channel
step 308: and the monitoring server receives the monitoring information, and triggers an alarm to notify a system administrator when the service is determined to fail.
The embodiment of the invention defines the fault reporting flow of the business module 113 after the business failure by adding the agent tool module in the monitored host and providing the service interface for the business module by the agent tool module, thereby realizing the monitoring of the monitoring system on the business failure caused by the fault of the non-monitored host. On one hand, the service module does not need to be coupled with the monitoring system, and the service module only needs to define keys and all fault description parameters of the JSON format required by the abnormal scene according to the abnormal scene of the service module. On the other hand, by combining and changing the template file and the JSON object, the content and the format of the monitoring information can be customized according to needs, the customized monitoring information can be reported to the monitoring server, and a system administrator can conveniently check detailed abnormal conditions of the service. On the other hand, because stateless HTTP communication is adopted between the service module and the agent tool module, even if the process of the monitoring system breaks down, the service module is not influenced, so that the service of the user is not influenced, and the safety of the service is ensured.
Corresponding to the monitoring system and the monitoring method, an embodiment of the present invention provides a monitored host, including the monitoring client, the agent module, and the service module. Each module in the monitored host executes the functions in the monitoring system and the monitoring method, and the embodiment of the invention is not repeated again.
Corresponding to the foregoing monitoring system and monitoring method, an embodiment of the present invention provides another monitored host, including the foregoing monitoring client and agent module. Each module in the monitored host executes the functions in the monitoring system and the monitoring method, and the embodiment of the invention is not repeated again. The service module may be located on another host in a network connection relationship with the monitored host.
Based on the same inventive concept, referring to fig. 4, an embodiment of the present application further provides another monitored host 400 in a monitoring system, which includes a transceiver 401, a processor 402, and a memory 403, where both the transceiver 401 and the memory 403 are connected to the processor 402, and it should be noted that the connection manner between the parts shown in fig. 4 is only one possible example, and also, both the transceiver 401 and the memory 403 are connected to the processor 402, and there is no connection between the transceiver 401 and the memory 403, or other possible connection manners.
The memory 403 stores a set of programs, and the processor 402 is configured to call the programs stored in the memory 403 to perform the functions of the modules of the monitored host in the monitoring system and the monitoring method shown in fig. 2 and 3.
In FIG. 4, the processor 402 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.
The processor 402 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The aforementioned PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
The memory 401 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory 401 may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated: HDD) or a solid-state drive (english: SSD); the memory 401 may also comprise a combination of the above kinds of memories.
The physical server where the monitoring server is located may also adopt a hardware structure as shown in fig. 4. The embodiment of the invention is not described in detail.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program codes may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner.
It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.