WO2018214009A1 - Server monitoring method and system - Google Patents

Server monitoring method and system Download PDF

Info

Publication number
WO2018214009A1
WO2018214009A1 PCT/CN2017/085437 CN2017085437W WO2018214009A1 WO 2018214009 A1 WO2018214009 A1 WO 2018214009A1 CN 2017085437 W CN2017085437 W CN 2017085437W WO 2018214009 A1 WO2018214009 A1 WO 2018214009A1
Authority
WO
WIPO (PCT)
Prior art keywords
template
node server
maintenance
server
data
Prior art date
Application number
PCT/CN2017/085437
Other languages
French (fr)
Chinese (zh)
Inventor
王一庭
牛丽华
蒋民
Original Assignee
深圳中兴力维技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳中兴力维技术有限公司 filed Critical 深圳中兴力维技术有限公司
Priority to PCT/CN2017/085437 priority Critical patent/WO2018214009A1/en
Publication of WO2018214009A1 publication Critical patent/WO2018214009A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions

Definitions

  • the present invention relates to the technical field of computer room operation and maintenance, and in particular, to a server monitoring method and system.
  • the monitoring of the server is very important. It usually needs to monitor a lot of data of the server, such as the resource usage of the hardware, the number of transactions of the software, the number of requests, etc., but as the system continues to expand, the server
  • the types of servers are also increasing.
  • the parameters that need to be monitored by different types of servers are also inconsistent. For example, a storage server focuses on the IOPS and storage space of the system, while an algorithm server focuses on the CPU usage. It is not sensitive to the use of the hard disk. In this operation, separate monitoring parameters should be set for different devices. If the number of devices is small, it is relatively easy to configure the monitoring parameters separately. However, as the number and types of devices increase, it is very troublesome to configure the monitoring parameters. Therefore, a monitoring method using a template is proposed here, and manual configuration and automatic judgment based on historical data are supported.
  • the main purpose of the present invention is to provide a server monitoring method and system, which are convenient for setting and updating corresponding monitoring parameters for different hosts.
  • the present invention provides a server monitoring method, where the method includes:
  • the master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host;
  • the slave node server compares the data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server reports the location Describe the data to the primary node server;
  • the primary node server reports the data to the operation and maintenance platform.
  • the method further includes: [0009]
  • the master node server receives the data generated by the slave node server for generating the alarm parameter value multiple times in the predetermined interval, and determines whether the alarm parameter value is smaller than the abnormal boundary value, and if yes, the master node server The operation and maintenance platform is reported to perform an alarm.
  • the method further includes:
  • the master node server automatically generates an alarm template, and sends the alarm template to the slave node server and reports to the operation and maintenance platform;
  • the operation and maintenance platform reports the alarm template to the user end.
  • the method further includes:
  • the operation and maintenance platform receives an operation and maintenance template configured by the user end;
  • the operation and maintenance platform saves the operation and maintenance template to the primary node server.
  • the template parameters include CPU usage, memory usage, network input/output, hard disk input/output, remaining space of the hard disk, number of network connections, monitoring of important service ports, and dedicated software deployed. The use of various parameters of its own.
  • the present invention further provides a server monitoring system, where the system includes a primary node server, at least one monitored host, a slave node server corresponding to the monitoring host, and an operation and maintenance platform. among them,
  • the master node server is configured to select a template parameter corresponding to the monitored host in an operation and maintenance template, and send the template parameter to the slave node server corresponding to the monitored host;
  • the slave node server is configured to compare the data generated by the monitored host with the template parameter, and report the data when the data generated by the monitored host meets the template parameter To the primary node server;
  • the primary node server is further configured to report the data to the operation and maintenance platform.
  • the primary node server is further configured to: generate an alarm parameter value by receiving data reported from the node server multiple times in a predetermined interval, and determine whether the alarm parameter value is less than an abnormal boundary value, If yes, the operation and maintenance platform is reported to perform an alarm.
  • the primary node server is further configured to automatically generate an alarm template, and send the template to the slave node server and report to the operation and maintenance platform;
  • the operation and maintenance platform is configured to report the alarm template to the user end.
  • the operation and maintenance platform is further configured to receive an operation and maintenance template configured by the client, and save the operation and maintenance template to the primary node server.
  • the template parameters include central processor occupancy, memory usage, network input/output, hard disk input/output, remaining space of the hard disk, number of network connections, monitoring of important service ports, and dedicated software deployed. The use of various parameters of its own.
  • the server monitoring method and system selects a template parameter corresponding to the monitored host in the operation and maintenance template by the primary node server, and sends the template parameter to the slave node server corresponding to the monitored host, the slave node.
  • the server compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform. Therefore, the complexity of the operation and maintenance parameters acquisition of different types of servers in the operation and maintenance system is reduced, and the operation and maintenance of the system is unified and managed through the deployment mode of the master-slave node server, and the same server operation and maintenance is performed through the use of the template. Consistency processing, flexible processing of the same type of operation and maintenance parameters through template inheritance.
  • FIG. 1 is a schematic flowchart of a server monitoring method according to a first embodiment of the present invention
  • FIG. 2 is a schematic diagram of an example of a server monitoring method according to a preferred embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a server monitoring method according to a second embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a server monitoring method according to a third embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a server monitoring system according to a fourth embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a server monitoring method according to a preferred embodiment of the present invention. The method includes the following steps:
  • Step 110 The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host.
  • one primary node server is deployed independently, and multiple monitored hosts are used, and each monitored host acts as a monitored terminal, and the slave node server is independently deployed on each monitored host.
  • Connect to the master node server to respond to the operation and maintenance parameter commands sent and received by the master node server, and report the operation and maintenance data and alarms.
  • the operation and maintenance template is configured on the primary node server, and the operation and maintenance template has a template parameter corresponding to each monitored host, so that the primary node server selects a template parameter corresponding to the monitored host, and sends the template parameter. To the slave node server that monitors the monitored host.
  • the template parameters include at least: a central processing unit (CPU) occupancy rate, memory usage, network input/output (10), hard disk input/output (10), remaining space of the hard disk, and number of network connections. , important service port monitoring, and the use of various parameters of the deployed dedicated software.
  • CPU central processing unit
  • the streaming media forwarding server only needs to pay attention to CPU usage, memory usage, and network 10
  • the number of network connections and the service parameters of the service itself, and the storage server may pay attention to the differences, focusing on the network 10, the hard disk 10, the remaining space of the hard disk and its own business parameters.
  • the monitored host of the trunk node and the monitored host of the edge node need different parameters to be monitored.
  • the data corresponding to the monitored host is obtained from the node server.
  • the slave node server and the master node server have a unified interface protocol to ensure communication consistency.
  • Step 120 The slave node server generates data according to the monitored host and the template parameter. Performing an alignment, when the data generated by the monitored host meets the template parameter, the slave node server reports the data to the master node server.
  • the data of the host to be monitored is obtained from the node server, and the acquired data is compared with the template parameters.
  • the template parameters are met, the abnormal data is displayed, and the operation and maintenance data of the monitored host is reported.
  • the template parameters of the equipment room include CPU usage, memory usage, network 10, hard disk 10, remaining space of the hard disk, number of network connections, status of important service ports, and so on.
  • the corresponding alarm threshold is: al-a6.
  • the monitored hosts in the equipment room include an intelligent analysis server, a storage server, and a streaming media forwarding server.
  • the template parameters of the intelligent analysis server include the CPU usage and memory usage.
  • the corresponding alarm threshold is bl-b2.
  • the template parameters of the storage server include the CPU usage, the memory usage, the network 10, the hard disk I 0, the remaining space of the hard disk, and the network connection parameters.
  • the corresponding alarm threshold is cl-c6.
  • the template parameters of the streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is dl-d4.
  • the streaming media forwarding server also includes servers of edge nodes: a core network streaming media forwarding server, a backbone network streaming media forwarding server, and an edge streaming media forwarding server.
  • the template parameters of the core network streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is dl l-dl4.
  • the template parameters of the trunk network streaming media forwarding server include the CP U occupancy rate, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is d21-d24.
  • the template parameters of the edge streaming media forwarding server include CPU usage, memory usage, network 10, and network connection.
  • the corresponding alarm threshold is d31-d34.
  • the streaming media forwarding server is the server of the primary node compared to the server of the edge node.
  • the template parameter dl of the streaming media forwarding server of the backbone node is set to 40%
  • the CPU usage of the streaming media forwarding server of the backbone node acquired from the node server is greater than 40%
  • an alarm is generated and the data is reported.
  • the template parameter d31 of the edge streaming media forwarding server can be set to 30%.
  • the CPU usage of the edge streaming media forwarding server obtained from the node server is greater than 30%, an alarm is generated and the data is reported to the primary node server.
  • the edge node server is configured on the basis of the backbone node server, as long as the template parameters of the trunk node are inherited to the template parameters of the edge node, and the template parameters of other servers do not need to be changed, to the greatest extent. Drop Low workload.
  • Step 130 The primary node server reports the data to the operation and maintenance platform.
  • the master node server reports the received data to the operation and maintenance platform, and the operation and maintenance platform reports the data to the client, so that the operator processes the data through the client.
  • the template parameter corresponding to the monitored host is selected in the operation and maintenance template by the primary node server, and the template parameter is sent to the slave node server corresponding to the monitored host, and the slave node server is configured according to the node server.
  • the data generated by the monitored host is compared with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform.
  • a second embodiment of the present invention further provides a server monitoring method, where the method includes:
  • Step 310 The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host.
  • Step 320 The slave node server compares data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server The data is reported to the primary node server.
  • Step 330 The primary node server reports the data to the operation and maintenance platform.
  • Step 340 the master node server receives the data reported from the node server to generate an alarm parameter value multiple times in a predetermined interval, and determines whether the alarm parameter value is less than an abnormal boundary value, and if yes, proceeds to the step 350.
  • the operation and maintenance parameter value X obtained from the node server can be abstracted as an approximate Gaussian distribution.
  • the operation and maintenance parameter xi generates t data ⁇ x(l), X (2), ..., x(t) ⁇ , assuming that a total of j kinds of operation and maintenance parameters participate in the judgment calculation .
  • Step 350 The master node server reports the operation and maintenance platform to perform an alarm.
  • the master node server may determine that the operation and maintenance machine generates an abnormality, and reports the alarm to the primary node server.
  • Step 360 The master node server automatically generates an alarm template, and sends the alarm template to the slave node server and reports to the operation and maintenance platform.
  • the alarm template is automatically generated according to the alarm data reported from the node server, and is sent to the slave node server and reported to the operation and maintenance platform.
  • the master node server can directly deliver the data to the slave through the TCP protocol through the data bearer mode of the XML.
  • the node server is effective immediately, ensuring the immediateness of operation and maintenance data and alarm acquisition.
  • the generation of the automatic alarm template is periodic, and the calculation of the larger sampling data is performed every other period of time, thereby avoiding the pressure of multiple calculations.
  • the T is automatically cleared.
  • the primary node server generally does not actively connect to the secondary node server, and only the operation and maintenance parameters need to be changed, and the primary node server actively sends signaling to the secondary node server.
  • Step 370 The operation and maintenance platform reports the alarm template to the user end.
  • the primary node server generates the alarm parameter value by receiving the data reported from the node server multiple times in the predetermined interval, and the alarm value is less than the abnormal boundary value, the primary node server The operation and maintenance platform is reported to perform alarms.
  • the master node server automatically generates an alarm template and automatically updates the template parameters to simplify the operation.
  • a third embodiment of the present invention further provides a server monitoring method.
  • the server monitoring method is a further improvement based on the first embodiment and the second embodiment, except that, before step 110 or step 310, the following steps are further included:
  • Step 410 The operation and maintenance platform receives the operation and maintenance template configured by the client.
  • the operator configures the operation and maintenance template through the user end, and the user end uploads the operation and maintenance template to the operation and maintenance platform, so that the operation and maintenance platform receives the operation and maintenance template.
  • the template of the slave node server can be manually or automatically configured on the master node server, and the monitored host of the same type can use a monitoring template, and the template includes monitoring parameters required by the server of the type.
  • the template specifies the operation and maintenance parameters of the monitored host, including hardware performance parameters and deployed software service parameters.
  • the configuration of the template parameters determines whether the operation and maintenance data of the monitored host is reported, including the alarm threshold and whether the historical operation and maintenance data is stored.
  • the manual template configured on the primary node server can be inherited.
  • different servers need to be added, if there are only a few changes with the existing template, most of the parameters of the original template can be inherited, and only a small number of parameters need to be modified to generate a new one.
  • Sub-templates templates are inherited in a tree structure.
  • the operation and maintenance template parameters can also automatically generate new templates through historical operation and maintenance data. , to achieve the optimal configuration of operation and maintenance parameters.
  • Step 420 The operation and maintenance platform saves the operation and maintenance template to the primary node server.
  • the operation and maintenance template after configuring the operation and maintenance template on the user end, it is necessary to first determine whether to use the sub-template corresponding to the edge node server, and if so, inherit the main template parameter to configure the sub-template parameter, if not, directly
  • the parameters of the operation and maintenance template are saved to the primary node server.
  • the master node server can directly communicate with the slave node server, that is, modify the operation and maintenance template parameters and take effect.
  • the operation and maintenance platform configures the operation and maintenance template and ensures the operation and maintenance template to the primary node server, and uses the operation and maintenance template to perform the consistency processing of the same type of server operation and maintenance. Inheritance, flexible handling of the same type of operation and maintenance parameters.
  • a fourth embodiment of the present invention provides a server monitoring system, which includes: an operation and maintenance platform 510, a client 520 connected to the operation and maintenance platform 510, and a master node server 530, the master node server. 530 is in communication with at least one slave node server 540. Each slave node server 540 corresponds to a monitored host (not shown).
  • the master node server 530 is configured to select a template parameter corresponding to the monitored host in the operation and maintenance template, and send the template parameter to the slave node server 540 corresponding to the monitored host.
  • one master node server 530 is independently deployed, and multiple monitored hosts are deployed.
  • Each monitored host acts as a monitored terminal, and the slave node server 540 is independently deployed on each monitored host to connect to the master node server 530 to respond to the operation and maintenance parameter commands sent and received by the master node server 530, and That is, the operation and maintenance data and alarms are reported.
  • the master node server 530 is configured with an operation and maintenance template, and the operation and maintenance template has a template parameter corresponding to each monitored host, so that the master node server 530 selects a template parameter corresponding to the monitored host, and the template is The parameters are sent to the slave node server 540 that monitors the monitored host.
  • the template parameters include at least: a central processing unit (CPU) occupancy rate, memory usage, network input/output (10), hard disk input/output (10), remaining space of the hard disk, and number of network connections. , important service port monitoring, and the use of various parameters of the deployed dedicated software.
  • CPU central processing unit
  • the streaming media forwarding server only needs to pay attention to CPU usage, memory usage, network 10, network connection number, and service.
  • the business parameters of the server itself may be different from the storage parameters, focusing on the network 10, the hard disk 10, the remaining space of the hard disk, and its own business parameters.
  • the monitored host of the backbone node and the monitored host of the edge node need different parameters to be monitored.
  • the data corresponding to the monitored host is acquired from the node server 540.
  • slave node server 540 and the master node server 530 have a unified interface protocol to ensure communication consistency.
  • the slave node server 540 is configured to compare the data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server 540 The data is reported to the primary node server 530.
  • the data corresponding to the monitored host is obtained from the node server 540, and the acquired data is compared with the template parameters.
  • the template parameters are met, the abnormal data is displayed, and the operation and maintenance data of the monitored host is performed. Reported to the primary node server 530.
  • the template parameters of the equipment room include CPU usage, memory usage, network 10, hard disk 10, remaining space of the hard disk, number of network connections, status of important service ports, and so on.
  • the corresponding alarm threshold is: al-a6.
  • the monitored hosts in the equipment room include an intelligent analysis server, a storage server, and a streaming media forwarding server.
  • the template parameters of the intelligent analysis server include the CPU usage and the memory usage.
  • the corresponding alarm threshold is bl-b2.
  • the template parameters of the storage server include the CPU usage, the memory usage, the network 10, the hard disk I 0, the remaining space of the hard disk, and the network connection parameters.
  • the corresponding alarm threshold is cl-c6.
  • the template parameters of the streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is dl-d4.
  • the streaming media forwarding server further includes servers of edge nodes: a core network streaming media forwarding server, a backbone network streaming media forwarding server, and an edge streaming media forwarding server.
  • the template parameters of the core network streaming media forwarding server include the CPU usage, the memory usage, the network 10, and the number of network connections.
  • the corresponding alarm threshold is dl l-dl4.
  • the template parameters of the backbone network forwarding server include the CP U occupancy rate, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is d21-d24.
  • Side The template parameters of the edge streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections.
  • the corresponding alarm threshold is d31-d34.
  • the streaming media forwarding server is the server of the primary node compared to the server of the edge node.
  • the template parameter dl of the streaming media forwarding server of the backbone node is set to 40%
  • the CPU usage of the streaming media forwarding server of the backbone node acquired by the node server 540 is greater than 40%
  • an alarm is generated and reported.
  • the data is sent to the primary node server 530.
  • the template parameter d31 of the edge streaming media forwarding server may be set to 30%.
  • an alarm is generated and the data is reported to the primary node server. 530.
  • the edge node server is configured on the basis of the backbone node server, as long as the template parameters of the trunk node are inherited to the template parameters of the edge node, and the template parameters of other servers do not need to be changed, to the greatest extent. Reduced the workload.
  • the master node server 530 is further configured to report the data to the operation and maintenance platform 510.
  • the master node server 530 reports the received data to the operation and maintenance platform 510, and the operation and maintenance platform 510 reports the data to the client 520, so that the operator processes the data through the client 520.
  • the master node server 530 is further configured to generate the alarm parameter value by receiving the data reported from the node server 540 multiple times in the predetermined interval, and determine whether the alarm parameter value is smaller than the abnormal boundary value, and if yes, report the value.
  • the operation and maintenance platform 510 performs an alarm.
  • the operation parameter value X obtained from the node server 540 can be abstracted as an approximate Gaussian distribution (if the operation parameter curve is asymmetric, logX can be used instead of X to process the curve as much as possible. Gaussian distribution), the template can be automatically generated by the historical operation and maintenance data generated to determine whether it is necessary to report the alarm.
  • the operation and maintenance parameter xi generates t data ⁇ x(l), X (2), ..., x(t) ⁇ , assuming that a total of j kinds of operation and maintenance parameters participate in the judgment calculation .
  • the abnormal boundary value s is determined,
  • the master node server 530 can determine that the operation and maintenance machine generates an abnormality and report it to the master node server 530 for an alarm.
  • the master node server 530 is further configured to automatically generate an alarm template, and send it to the slave node server 5 40 and report to the operation and maintenance platform 510.
  • the alarm template is automatically generated according to the alarm data reported from the node server 540, and is sent to the slave node server 540 and reported to the operation and maintenance platform 510.
  • the master node server 530 can be directly delivered to the slave node server 540 by using the data transfer mode of the XML to ensure the immediateness of the operation and maintenance data and the alarm acquisition.
  • the generation of the automatic alarm template is periodic, and the calculation of the larger sampling data is performed every other period of time, thereby avoiding the pressure of multiple calculations.
  • the master node server 530 does not actively connect to the slave node server 540. Only when the operation and maintenance parameters need to be changed, the master node server 530 actively sends signaling to the slave node server 540. [0117]
  • the operation and maintenance platform 510 is configured to report the alarm template to the client 520.
  • the operation and maintenance platform 510 is further configured to receive an operation and maintenance template configured by the client 520, and save the operation and maintenance template to the primary node server 530.
  • the operator configures the operation and maintenance template through the client 520, and the client 520 uploads the operation and maintenance template to the operation and maintenance platform 510, so that the operation and maintenance platform 510 receives the operation and maintenance template.
  • the master node server 530 can manually or automatically configure the template of the slave node server 540.
  • the same type of monitored host can use a monitoring template, and the template includes monitoring parameters required by the type server.
  • the template specifies the operation and maintenance parameters of the monitored host, including hardware performance parameters and deployed software service parameters.
  • the setting of the template parameters specifies whether the operation and maintenance data of the monitored host is reported, including the alarm threshold and whether the historical operation and maintenance data is stored.
  • the manual template configured on the primary node server 530 can be inherited. When different servers need to be added, if there are only a few changes with the existing template, most of the parameters of the original template can be inherited, and only a small number of parameters need to be modified to generate a new one.
  • the child template, the template inherits in a tree structure.
  • the template parameters of the operation and maintenance can also automatically generate new templates through historical operation and maintenance data to achieve optimal configuration of operation and maintenance parameters.
  • the operation and maintenance template is configured on the user end 520, it is necessary to first determine whether to use the sub-template corresponding to the edge node server. If yes, the main template parameter is inherited to configure the sub-template parameter, and if not used, the operation and maintenance are directly performed. The parameters of the template are saved to the master node server 530.
  • the operation and maintenance template is directly modified on the operation and maintenance platform 510, and after the instruction is sent to the primary node server 530, the primary node server 530 can directly interact with the secondary node server. 540 communication, that is, modify the operation and maintenance template parameters and take effect.
  • the server monitoring system of the embodiment selects a template parameter corresponding to the monitored host in the operation and maintenance template by the master node server 530, and sends the template parameter to the slave node server 540 corresponding to the monitored host, and the slave node.
  • the server 540 compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the slave node server 540 reports the data to the master node server 530, and the master node server 530 reports the data to the operation and maintenance. Platform 510. Therefore, the complexity of obtaining operation and maintenance parameters of different types of servers in the operation and maintenance system is reduced, and the deployment mode of the master-slave node server is adopted.
  • To manage the operation and maintenance of the system in a unified manner use the template to perform the consistency processing of the same type of server operation and maintenance, and flexibly handle the differentiation of the same type of operation and maintenance parameters through template inheritance.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • a storage medium such as ROM/RAM, disk
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
  • the server monitoring method and system provided by the present invention selects a template parameter corresponding to the monitored host in the operation and maintenance template by the primary node server, and sends the template parameter to the slave node server corresponding to the monitored host, the slave node.
  • the server compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention provides a server monitoring method and system belonging to the technical field of machine room operation and maintenance. The method comprises: a master node server selecting from an operation and maintenance template a template parameter corresponding to a monitored host, and sending the template parameter to a slave node server corresponding to the monitored host; the slave node server comparing data generated by the monitored host with the template parameter, and the slave node server reporting data to the master node server when the data generated by the monitored host matches the template parameter; and the master node server reporting the data to an operation and maintenance platform. The server monitoring method and system provided by the present invention reduce complexities of operation and maintenance parameter acquisition for different types of servers in an operation and maintenance system, perform centralized management on operation and maintenance of a system by means of deploying master and slave node servers, perform unified processing on operation and maintenance of the same type of servers through use of templates, and flexibly handle differentiation of the same type of operation and maintenance parameters by means of template inheritance.

Description

说明书 发明名称:服务器监控方法及系统  Specification Name of Invention: Server Monitoring Method and System
技术领域  Technical field
[0001] 本发明涉及机房运维技术领域, 尤其涉及一种服务器监控方法及系统。  [0001] The present invention relates to the technical field of computer room operation and maintenance, and in particular, to a server monitoring method and system.
背景技术  Background technique
[0002] 在机房运维中, 服务器的监控非常重要的一环, 通常需要监控服务器的很多数 据, 比如硬件的资源使用, 软件的事务数、 请求数等, 但是随着系统的不断扩 大, 服务器的种类也日益增多, 不同类型的服务器需要监控的参数也不一致, 比如一台存储服务器, 重点关注的就是系统的 IOPS和存储剩余空间, 而一台算 法服务器, 重点关注的就是 CPU的使用量, 而对于硬盘的使用则不敏感。 这样运 维中就要针对不同的设备设置单独的监控参数, 如果设备数不多, 单独配置监 控参数还是比较容易的, 但是随着设备数量和种类的增多, 配置监控参数就很 麻烦了。 所以这里提出一种使用模板的监控方法, 并且支持手动配置和根据历 史数据自动判断。  [0002] In the operation and maintenance of the equipment room, the monitoring of the server is very important. It usually needs to monitor a lot of data of the server, such as the resource usage of the hardware, the number of transactions of the software, the number of requests, etc., but as the system continues to expand, the server The types of servers are also increasing. The parameters that need to be monitored by different types of servers are also inconsistent. For example, a storage server focuses on the IOPS and storage space of the system, while an algorithm server focuses on the CPU usage. It is not sensitive to the use of the hard disk. In this operation, separate monitoring parameters should be set for different devices. If the number of devices is small, it is relatively easy to configure the monitoring parameters separately. However, as the number and types of devices increase, it is very troublesome to configure the monitoring parameters. Therefore, a monitoring method using a template is proposed here, and manual configuration and automatic judgment based on historical data are supported.
技术问题  technical problem
[0003] 本发明的主要目的在于提出一种服务器监控方法及系统, 方便针对不同主机设 置和更新对应的监控参数。  [0003] The main purpose of the present invention is to provide a server monitoring method and system, which are convenient for setting and updating corresponding monitoring parameters for different hosts.
问题的解决方案  Problem solution
技术解决方案  Technical solution
[0004] 为实现上述目的, 本发明提供一种服务器监控方法, 所述方法包括:  [0004] In order to achieve the above object, the present invention provides a server monitoring method, where the method includes:
[0005] 主节点服务器在运维模板中选择与被监控主机对应的模板参数, 并将所述模板 参数发送至与所述被监控主机对应的从节点服务器;  [0005] The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host;
[0006] 所述从节点服务器根据所述被监控主机产生的数据与所述模板参数进行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 则所述从节点服务器上报 所述数据至所述主节点服务器; [0006] The slave node server compares the data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server reports the location Describe the data to the primary node server;
[0007] 所述主节点服务器将所述数据上报至运维平台。 [0007] The primary node server reports the data to the operation and maintenance platform.
[0008] 可选地, 所述方法还包括: [0009] 所述主节点服务器将在预定吋间段内多次接收从节点服务器上报的数据生成告 警参数值, 并判断所述告警参数值是否小于异常边界值, 若是, 则所述主节点 服务器上报所述运维平台进行告警。 Optionally, the method further includes: [0009] The master node server receives the data generated by the slave node server for generating the alarm parameter value multiple times in the predetermined interval, and determines whether the alarm parameter value is smaller than the abnormal boundary value, and if yes, the master node server The operation and maintenance platform is reported to perform an alarm.
[0010] 可选地, 所述方法还包括:  [0010] Optionally, the method further includes:
[0011] 所述主节点服务器自动生成告警模板, 并下发到所述从节点服务器、 以及上报 至所述运维平台;  [0011] The master node server automatically generates an alarm template, and sends the alarm template to the slave node server and reports to the operation and maintenance platform;
[0012] 所述运维平台将所述告警模板上报至用户端。  [0012] The operation and maintenance platform reports the alarm template to the user end.
[0013] 可选地, 在所述主节点服务器在运维模板中选择与被监控主机对应的模板参数 之前, 所述方法还包括:  [0013] Optionally, before the master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, the method further includes:
[0014] 所述运维平台接收用户端配置的运维模板; [0014] the operation and maintenance platform receives an operation and maintenance template configured by the user end;
[0015] 所述运维平台将所述运维模板保存至所述主节点服务器。 [0015] The operation and maintenance platform saves the operation and maintenance template to the primary node server.
[0016] 可选地, 所述模板参数包括中央处理器占用率, 内存使用、 网络输入 /输出、 硬盘输入 /输出、 硬盘剩余空间、 网络连接数、 重要服务端口监听情况、 以及部 署的专用软件的自身各类参数使用情况。 [0016] Optionally, the template parameters include CPU usage, memory usage, network input/output, hard disk input/output, remaining space of the hard disk, number of network connections, monitoring of important service ports, and dedicated software deployed. The use of various parameters of its own.
[0017] 此外, 为实现上述目的, 本发明还提供一种服务器监控系统, 所述系统包括主 节点服务器、 至少一个被监控主机、 与所述监控主机对应的从节点服务器、 以 及运维平台, 其中, [0017] In addition, in order to achieve the above object, the present invention further provides a server monitoring system, where the system includes a primary node server, at least one monitored host, a slave node server corresponding to the monitoring host, and an operation and maintenance platform. among them,
[0018] 所述主节点服务器, 设置为在运维模板中选择与所述被监控主机对应的模板参 数, 并将所述模板参数发送至与所述被监控主机对应的所述从节点服务器; [0019] 所述从节点服务器, 设置为根据所述被监控主机产生的数据与所述模板参数进 行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 则上报所述数据 至所述主节点服务器;  [0018] the master node server is configured to select a template parameter corresponding to the monitored host in an operation and maintenance template, and send the template parameter to the slave node server corresponding to the monitored host; [0019] The slave node server is configured to compare the data generated by the monitored host with the template parameter, and report the data when the data generated by the monitored host meets the template parameter To the primary node server;
[0020] 所述主节点服务器, 还设置为将所述数据上报至运维平台。 [0020] The primary node server is further configured to report the data to the operation and maintenance platform.
[0021] 可选地, 所述主节点服务器, 还设置为将在预定吋间段内多次接收从节点服务 器上报的数据生成告警参数值, 并判断所述告警参数值是否小于异常边界值, 若是, 则上报所述运维平台进行告警。 [0021] Optionally, the primary node server is further configured to: generate an alarm parameter value by receiving data reported from the node server multiple times in a predetermined interval, and determine whether the alarm parameter value is less than an abnormal boundary value, If yes, the operation and maintenance platform is reported to perform an alarm.
[0022] 可选地, 所述主节点服务器, 还设置为自动生成告警模板, 并下发到所述从节 点服务器、 以及上报至所述运维平台; [0023] 所述运维平台, 设置为将所述告警模板上报至用户端。 [0022] Optionally, the primary node server is further configured to automatically generate an alarm template, and send the template to the slave node server and report to the operation and maintenance platform; [0023] The operation and maintenance platform is configured to report the alarm template to the user end.
[0024] 可选地, 所述运维平台, 还设置为接收用户端配置的运维模板, 以及将所述运 维模板保存至所述主节点服务器。  [0024] Optionally, the operation and maintenance platform is further configured to receive an operation and maintenance template configured by the client, and save the operation and maintenance template to the primary node server.
[0025] 可选地, 所述模板参数包括中央处理器占用率, 内存使用、 网络输入 /输出、 硬盘输入 /输出、 硬盘剩余空间、 网络连接数、 重要服务端口监听情况、 以及部 署的专用软件的自身各类参数使用情况。 [0025] Optionally, the template parameters include central processor occupancy, memory usage, network input/output, hard disk input/output, remaining space of the hard disk, number of network connections, monitoring of important service ports, and dedicated software deployed. The use of various parameters of its own.
发明的有益效果  Advantageous effects of the invention
有益效果  Beneficial effect
[0026] 本发明提出的服务器监控方法及系统, 通过主节点服务器在运维模板中选择与 被监控主机对应的模板参数, 并将模板参数发送至与被监控主机对应的从节点 服务器, 从节点服务器根据被监控主机产生的数据与模板参数进行比对, 当被 监控主机产生的数据符合模板参数吋, 从节点服务器上报数据至主节点服务器 , 主节点服务器将数据上报至运维平台。 从而减少了不同类型的服务器在运维 系统中运维参数获取的复杂性, 通过主从节点服务器的部署方式, 来进行系统 的运维统一化管理, 通过模板的使用, 来进行同类服务器运维的一致性处理, 通过模板的继承, 灵活处理同类型运维参数的差异化。  The server monitoring method and system provided by the present invention selects a template parameter corresponding to the monitored host in the operation and maintenance template by the primary node server, and sends the template parameter to the slave node server corresponding to the monitored host, the slave node. The server compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform. Therefore, the complexity of the operation and maintenance parameters acquisition of different types of servers in the operation and maintenance system is reduced, and the operation and maintenance of the system is unified and managed through the deployment mode of the master-slave node server, and the same server operation and maintenance is performed through the use of the template. Consistency processing, flexible processing of the same type of operation and maintenance parameters through template inheritance.
对附图的简要说明  Brief description of the drawing
附图说明  DRAWINGS
[0027] 图 1为本发明第一实施例提供的服务器监控方法的流程示意图;  1 is a schematic flowchart of a server monitoring method according to a first embodiment of the present invention;
[0028] 图 2为本发明较佳实施例提供的服务器监控方法的示例示意图; 2 is a schematic diagram of an example of a server monitoring method according to a preferred embodiment of the present invention;
[0029] 图 3为本发明第二实施例提供的服务器监控方法的流程示意图; 3 is a schematic flowchart of a server monitoring method according to a second embodiment of the present invention;
[0030] 图 4为本发明第三实施例提供的服务器监控方法的流程示意图; 4 is a schematic flowchart of a server monitoring method according to a third embodiment of the present invention;
[0031] 图 5为本发明第四实施例提供的服务器监控系统的架构示意图。 FIG. 5 is a schematic structural diagram of a server monitoring system according to a fourth embodiment of the present invention.
[0032] 本发明目的的实现、 功能特点及优点将结合实施例, 参照附图做进一步说明。 [0032] The implementation, functional features, and advantages of the present invention will be further described with reference to the accompanying drawings.
本发明的实施方式 Embodiments of the invention
[0033] 下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至 终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。 下 面通过参考附图描述的实施例是示例性的, 旨在设置为解释本发明, 而不能理 解为对本发明的限制。 [0033] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, The same or similar reference numerals denote the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative, and are not to be construed as limiting.
[0034] 请参照图 1, 为本发明较佳实施例提供的服务器监控方法的流程示意图, 所述 方法包括步骤:  1 is a schematic flowchart of a server monitoring method according to a preferred embodiment of the present invention. The method includes the following steps:
[0035] 步骤 110, 主节点服务器在运维模板中选择与被监控主机对应的模板参数, 并 将所述模板参数发送至与所述被监控主机对应的从节点服务器。  [0035] Step 110: The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host.
[0036] 具体地, 在运维平台中, 独立部署一个主节点服务器, 多台被监控主机, 每台 被监控主机作为被监控端, 并在每台被监控主机上独立部署从节点服务器, 来 连接主节点服务器, 以即吋响应主节点服务器上下发的运维参数指令, 并即吋 上报运维数据和告警。  [0036] Specifically, in the operation and maintenance platform, one primary node server is deployed independently, and multiple monitored hosts are used, and each monitored host acts as a monitored terminal, and the slave node server is independently deployed on each monitored host. Connect to the master node server to respond to the operation and maintenance parameter commands sent and received by the master node server, and report the operation and maintenance data and alarms.
[0037] 主节点服务器上配置有运维模板, 该运维模板中具有与每个被监控主机对应的 模板参数, 从而主节点服务器选择与被监控主机对应的模板参数, 并将该模板 参数发送至监控被监控主机的从节点服务器。  [0037] The operation and maintenance template is configured on the primary node server, and the operation and maintenance template has a template parameter corresponding to each monitored host, so that the primary node server selects a template parameter corresponding to the monitored host, and sends the template parameter. To the slave node server that monitors the monitored host.
[0038] 进一步地, 模板参数至少包括: 中央处理器 (Central Processing Unit, CPU) 占用率, 内存使用、 网络输入 /输出 (10) 、 硬盘输入 /输出 (10) 、 硬盘剩余空 间、 网络连接数、 重要服务端口监听情况、 以及部署的专用软件的自身各类参 数使用情况。  [0038] Further, the template parameters include at least: a central processing unit (CPU) occupancy rate, memory usage, network input/output (10), hard disk input/output (10), remaining space of the hard disk, and number of network connections. , important service port monitoring, and the use of various parameters of the deployed dedicated software.
[0039] 进一步地, 由于目前服务器的多样性, 不同类型的从节点服务器需要监控的参 数不同, 例如: 流媒体转发服务器只需要关注 CPU占用率, 内存使用、 网络 10 [0039] Further, due to the diversity of the current server, different types of slave nodes need to monitor different parameters, for example: the streaming media forwarding server only needs to pay attention to CPU usage, memory usage, and network 10
、 网络连接数以及服务本身的业务参数, 而存储服务器可能会关注的不同, 重 点关注网络 10、 硬盘 10、 硬盘剩余空间和自身业务参数。 The number of network connections and the service parameters of the service itself, and the storage server may pay attention to the differences, focusing on the network 10, the hard disk 10, the remaining space of the hard disk and its own business parameters.
[0040] 进一步地, 主干节点的被监控主机和边缘节点的被监控主机需要监控的参数不 同。 [0040] Further, the monitored host of the trunk node and the monitored host of the edge node need different parameters to be monitored.
[0041] 进一步地, 通过从节点服务器获取对应被监控主机的数据。  [0041] Further, the data corresponding to the monitored host is obtained from the node server.
[0042] 进一步地, 从节点服务器和主节点服务器具有统一的接口协议, 以保证通讯的 一致性。  [0042] Further, the slave node server and the master node server have a unified interface protocol to ensure communication consistency.
[0043] 步骤 120, 所述从节点服务器根据所述被监控主机产生的数据与所述模板参数 进行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 所述从节点服 务器上报所述数据至所述主节点服务器。 [0043] Step 120: The slave node server generates data according to the monitored host and the template parameter. Performing an alignment, when the data generated by the monitored host meets the template parameter, the slave node server reports the data to the master node server.
[0044] 具体地, 从节点服务器获取对应被监控主机的数据, 并将获取的数据与模板参 数进行比对, 当符合模板参数吋, 说明出现异常数据, 并将被监控主机的运维 数据上报至主节点服务器。  [0044] Specifically, the data of the host to be monitored is obtained from the node server, and the acquired data is compared with the template parameters. When the template parameters are met, the abnormal data is displayed, and the operation and maintenance data of the monitored host is reported. To the primary node server.
[0045] 示例性地, 如图 2所示, 机房模板参数包括 CPU占用率、 内存占用、 网络 10、 硬盘 10、 硬盘剩余空间、 网络连接数、 重要服务端口状态等, 每个运维参数设 置对应的告警阈值为: al-a6。  [0045] Exemplarily, as shown in FIG. 2, the template parameters of the equipment room include CPU usage, memory usage, network 10, hard disk 10, remaining space of the hard disk, number of network connections, status of important service ports, and so on. The corresponding alarm threshold is: al-a6.
[0046] 机房中的被监控主机包括智能分析服务器、 存储服务器和流媒体转发服务器。  [0046] The monitored hosts in the equipment room include an intelligent analysis server, a storage server, and a streaming media forwarding server.
其中, 智能分析服务器的模板参数包括 CPU占用率和内存占用, 对应的告警阈值 为: bl-b2。 存储服务器的模板参数包括 CPU占用率、 内存占用、 网络 10、 硬盘 I 0、 硬盘剩余空间和网络连接参数, 对应的告警阈值为: cl-c6。 流媒体转发服务 器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连接数, 对应的告警 阈值为: dl-d4。 流媒体转发服务器还包括边缘节点的服务器: 核心网流媒体转 发服务器、 主干网流媒体转发服务器、 边缘流媒体转发服务器。 其中, 核心网 流媒体转发服务器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连接 数, 对应的告警阈值为: dl l-dl4。 主干网流媒体转发服务器的模板参数包括 CP U占用率、 内存占用、 网络 10和网络连接数, 对应的告警阈值为: d21-d24。 边 缘流媒体转发服务器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连 接数, 对应的告警阈值为: d31-d34。 在该示例中, 与边缘节点的服务器相比, 流媒体转发服务器为主干节点的服务器。  The template parameters of the intelligent analysis server include the CPU usage and memory usage. The corresponding alarm threshold is bl-b2. The template parameters of the storage server include the CPU usage, the memory usage, the network 10, the hard disk I 0, the remaining space of the hard disk, and the network connection parameters. The corresponding alarm threshold is cl-c6. The template parameters of the streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections. The corresponding alarm threshold is dl-d4. The streaming media forwarding server also includes servers of edge nodes: a core network streaming media forwarding server, a backbone network streaming media forwarding server, and an edge streaming media forwarding server. The template parameters of the core network streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections. The corresponding alarm threshold is dl l-dl4. The template parameters of the trunk network streaming media forwarding server include the CP U occupancy rate, memory usage, network 10, and number of network connections. The corresponding alarm threshold is d21-d24. The template parameters of the edge streaming media forwarding server include CPU usage, memory usage, network 10, and network connection. The corresponding alarm threshold is d31-d34. In this example, the streaming media forwarding server is the server of the primary node compared to the server of the edge node.
[0047] 若将主干节点的流媒体转发服务器的模板参数 dl设置为 40%, 则当从节点服务 器获取的主干节点的流媒体转发服务器的 CPU占用率大于 40%吋, 则产生告警并 上报数据至主节点服务器。 而边缘流媒体转发服务器的模板参数 d31可以设置为 30% , 则当从节点服务器获取的边缘流媒体转发服务器的 CPU占用率大于 30%吋 , 则产生告警并上报数据至主节点服务器。 也就是说, 在主干节点服务器的基 础上配置边缘节点服务器吋, 只要选择将主干节点的大部门模板参数继承到边 缘节点的模板参数即可, 而其他服务器的模板参数不需要改动, 最大程度上降 低了工作量。 [0047] If the template parameter dl of the streaming media forwarding server of the backbone node is set to 40%, when the CPU usage of the streaming media forwarding server of the backbone node acquired from the node server is greater than 40%, an alarm is generated and the data is reported. To the primary node server. The template parameter d31 of the edge streaming media forwarding server can be set to 30%. When the CPU usage of the edge streaming media forwarding server obtained from the node server is greater than 30%, an alarm is generated and the data is reported to the primary node server. That is to say, the edge node server is configured on the basis of the backbone node server, as long as the template parameters of the trunk node are inherited to the template parameters of the edge node, and the template parameters of other servers do not need to be changed, to the greatest extent. Drop Low workload.
[0048] 步骤 130, 所述主节点服务器将所述数据上报至运维平台。  [0048] Step 130: The primary node server reports the data to the operation and maintenance platform.
[0049] 具体地, 主节点服务器将接收的数据上报至运维平台, 运维平台再将数据上报 至用户端, 使操作者通过用户端处理数据。  [0049] Specifically, the master node server reports the received data to the operation and maintenance platform, and the operation and maintenance platform reports the data to the client, so that the operator processes the data through the client.
[0050] 本实施例的服务器监控方法, 通过主节点服务器在运维模板中选择与被监控主 机对应的模板参数, 并将模板参数发送至与被监控主机对应的从节点服务器, 从节点服务器根据被监控主机产生的数据与模板参数进行比对, 当被监控主机 产生的数据符合模板参数吋, 从节点服务器上报数据至主节点服务器, 主节点 服务器将数据上报至运维平台。 从而减少了不同类型的服务器在运维系统中运 维参数获取的复杂性, 通过主从节点服务器的部署方式, 来进行系统的运维统 一化管理, 通过模板的使用, 来进行同类服务器运维的一致性处理, 通过模板 的继承, 灵活处理同类型运维参数的差异化。  [0050] In the server monitoring method of the embodiment, the template parameter corresponding to the monitored host is selected in the operation and maintenance template by the primary node server, and the template parameter is sent to the slave node server corresponding to the monitored host, and the slave node server is configured according to the node server. The data generated by the monitored host is compared with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform. Therefore, the complexity of the operation and maintenance parameters acquisition of different types of servers in the operation and maintenance system is reduced, and the operation and maintenance of the system is unified and managed through the deployment mode of the master-slave node server, and the same server operation and maintenance is performed through the use of the template. Consistency processing, flexible processing of the same type of operation and maintenance parameters through template inheritance.
[0051] 请参照图 3, 本发明第二实施例进一步提供一种服务器监控方法, 所述方法包 括:  [0051] Referring to FIG. 3, a second embodiment of the present invention further provides a server monitoring method, where the method includes:
[0052] 步骤 310, 主节点服务器在运维模板中选择与被监控主机对应的模板参数, 并 将所述模板参数发送至与所述被监控主机对应的从节点服务器。  [0052] Step 310: The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host.
[0053] 步骤 320, 所述从节点服务器根据所述被监控主机产生的数据与所述模板参数 进行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 所述从节点服 务器上报所述数据至所述主节点服务器。 [0053] Step 320: The slave node server compares data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server The data is reported to the primary node server.
[0054] 步骤 330, 所述主节点服务器将所述数据上报至运维平台。 [0054] Step 330: The primary node server reports the data to the operation and maintenance platform.
[0055] 上述步骤 310-330的内容与第一实施例的步骤 110-13-的内容相同, 对于相同的 内容, 本实施例则不再赘述。 [0055] The content of the above steps 310-330 is the same as the content of the step 110-13- of the first embodiment, and the same content is not described in detail in this embodiment.
[0056] 步骤 340, 所述主节点服务器将在预定吋间段内多次接收从节点服务器上报的 数据生成告警参数值, 并判断所述告警参数值是否小于异常边界值, 若是, 则 进入步骤 350。 [0056] Step 340, the master node server receives the data reported from the node server to generate an alarm parameter value multiple times in a predetermined interval, and determines whether the alarm parameter value is less than an abnormal boundary value, and if yes, proceeds to the step 350.
[0057] 具体地, 从节点服务器获取的运维参数值 X可以抽象化看作近似符合高斯分布  [0057] Specifically, the operation and maintenance parameter value X obtained from the node server can be abstracted as an approximate Gaussian distribution.
(如果运维参数曲线非对称, 则可以使用 logX来代替 X进行处理, 使曲线尽量趋 于高斯分布) , 可以通过产生的历史运维数据来自动生成模板, 来判断是否需 要上报告警。 (If the operation and maintenance parameter curve is asymmetrical, you can use logX instead of X to process the curve so that the curve tends to be Gaussian.) You can use the generated historical operation and maintenance data to automatically generate a template to determine whether it is needed. To report the police.
[0058] 假如一段吋间 T内, 运维参数 xi产生 t条数据 {x(l),X(2),...,x(t)}条, 假设共有 j种运 维参数参与判断计算。 [0058] If there is a period of time T, the operation and maintenance parameter xi generates t data {x(l), X (2), ..., x(t)}, assuming that a total of j kinds of operation and maintenance parameters participate in the judgment calculation .
[0059] 运行过程中取平均值 uj:
Figure imgf000009_0001
[0059] averaging uj during operation:
Figure imgf000009_0001
[0060] 取标准差 oj:
Figure imgf000009_0002
[0060] Take the standard deviation oj:
Figure imgf000009_0002
[0061] 对告警参数值建立建立高斯函数:  [0061] Establishing a Gaussian function for the alarm parameter value:
】. ¾ ¾ ™ : -.„_ " Λ' 】. 3⁄4 3⁄4 TM : -.„_ " Λ '
[0062] 根据从节点服务器上报至主节点服务器的历史采样的 η条正常值 (剔除采样吋 间段内的所有异常值) , 判断异常边界值 s, [0062] determining an abnormal boundary value s according to n normal values of historical samples reported from the node server to the primary node server (excluding all outliers in the sampling interval),
[0063] =腾 / ( .....  [0063] = Teng / (.....
[0064] 判断所述告警参数值 f(x)是否小于异常边界值 s, 若是, 则进入步骤 350。 [0064] determining whether the alarm parameter value f(x) is less than the abnormal boundary value s, and if yes, proceeding to step 350.
[0065] 步骤 350, 主节点服务器上报所述运维平台进行告警。 [0065] Step 350: The master node server reports the operation and maintenance platform to perform an alarm.
[0066] 具体地, 当 f(x)<s吋, 则主节点服务器可以判断运维机器产生异常, 并上报至 主节点服务器进行告警。 [0066] Specifically, when f(x)<s吋, the master node server may determine that the operation and maintenance machine generates an abnormality, and reports the alarm to the primary node server.
[0067] 步骤 360, 所述主节点服务器自动生成告警模板, 并下发到所述从节点服务器 、 以及上报至所述运维平台。  [0067] Step 360: The master node server automatically generates an alarm template, and sends the alarm template to the slave node server and reports to the operation and maintenance platform.
[0068] 具体地, 根据从节点服务器上报的告警数据自动生成告警模板, 并下发到从节 点服务器上, 以及上报至运维平台。  [0068] Specifically, the alarm template is automatically generated according to the alarm data reported from the node server, and is sent to the slave node server and reported to the operation and maintenance platform.
[0069] 进一步地, 主节点服务器可通过 XML的数据承载方式以 TCP协议直接下发到从 节点服务器上即吋生效, 保障运维数据和告警获取的即吋性。 [0069] Further, the master node server can directly deliver the data to the slave through the TCP protocol through the data bearer mode of the XML. The node server is effective immediately, ensuring the immediateness of operation and maintenance data and alarm acquisition.
[0070] 进一步地, 自动告警模板的生成是周期性的, 每隔一段吋间 T才会通过更大的 采样数据来计算, 避免多次计算的压力。 [0070] Further, the generation of the automatic alarm template is periodic, and the calculation of the larger sampling data is performed every other period of time, thereby avoiding the pressure of multiple calculations.
[0071] 进一步地, 当主节点服务器将告警模板下发至从节点服务器后, 则 T自动清零 [0071] Further, when the master node server sends the alarm template to the slave node server, the T is automatically cleared.
[0072] 进一步地, 主节点服务器一般不主动连接从节点服务器, 只有运维参数需要改 变吋主节点服务器才会主动与从节点服务器下发信令。 [0072] Further, the primary node server generally does not actively connect to the secondary node server, and only the operation and maintenance parameters need to be changed, and the primary node server actively sends signaling to the secondary node server.
[0073] 步骤 370, 运维平台将所述告警模板上报至用户端。 [0073] Step 370: The operation and maintenance platform reports the alarm template to the user end.
[0074] 本实施例的服务器监控方法, 通过主节点服务器将在预定吋间段内多次接收从 节点服务器上报的数据生成告警参数值, 并在告警参数值小于异常边界值吋, 主节点服务器上报运维平台进行告警, 主节点服务器自动生成告警模板, 并自 动更新模板参数, 从而达到简化操作的目的。  [0074] In the server monitoring method of the embodiment, the primary node server generates the alarm parameter value by receiving the data reported from the node server multiple times in the predetermined interval, and the alarm value is less than the abnormal boundary value, the primary node server The operation and maintenance platform is reported to perform alarms. The master node server automatically generates an alarm template and automatically updates the template parameters to simplify the operation.
[0075] 请参照图 4, 本发明第三实施例进一步提供一种服务器监控方法。 在第四实施 例中, 所述服务器监控方法是在第一实施例和第二实施例的基础上做出的进一 步改进, 区别仅在于, 在步骤 110或者步骤 310之前, 还包括以下步骤:  Referring to FIG. 4, a third embodiment of the present invention further provides a server monitoring method. In the fourth embodiment, the server monitoring method is a further improvement based on the first embodiment and the second embodiment, except that, before step 110 or step 310, the following steps are further included:
[0076] 步骤 410, 运维平台接收用户端配置的运维模板。  [0076] Step 410: The operation and maintenance platform receives the operation and maintenance template configured by the client.
[0077] 具体地, 操作者通过用户端配置运维模板, 用户端将运维模板上传至运维平台 , 使运维平台接收该运维模板。  [0077] Specifically, the operator configures the operation and maintenance template through the user end, and the user end uploads the operation and maintenance template to the operation and maintenance platform, so that the operation and maintenance platform receives the operation and maintenance template.
[0078] 主节点服务器上可手动或者自动配置从节点服务器的模板, 同类型的被监控主 机可使用一个监控模板, 模板包括该类型服务器所需要的监控参数。 对不同类 型的被监控主机主机设置不同的运维模板, 模板里规定了被监控主机里的运维 参数, 包括硬件性能参数、 和部署的软件服务参数。 通过模板参数的设置, 规 定了被监控主机的运维数据是否上报, 包括告警阈值、 以及历史运维数据是否 存储等。  [0078] The template of the slave node server can be manually or automatically configured on the master node server, and the monitored host of the same type can use a monitoring template, and the template includes monitoring parameters required by the server of the type. Set different operation and maintenance templates for different types of monitored host hosts. The template specifies the operation and maintenance parameters of the monitored host, including hardware performance parameters and deployed software service parameters. The configuration of the template parameters determines whether the operation and maintenance data of the monitored host is reported, including the alarm threshold and whether the historical operation and maintenance data is stored.
[0079] 主节点服务器上配置的手动模板可以继承, 当需要增加不同的服务器吋, 如果 和已有模板只有少量改动, 则可以继承原来模板大部分参数, 只需要修改少量 参数即可生成新的子模板, 模板以树状结构继承。  [0079] The manual template configured on the primary node server can be inherited. When different servers need to be added, if there are only a few changes with the existing template, most of the parameters of the original template can be inherited, and only a small number of parameters need to be modified to generate a new one. Sub-templates, templates are inherited in a tree structure.
[0080] 运维的模板参数除了手动配置外, 也可自动通过历史运维数据来产生新的模板 , 来达到运维参数的最优化配置。 [0080] In addition to manual configuration, the operation and maintenance template parameters can also automatically generate new templates through historical operation and maintenance data. , to achieve the optimal configuration of operation and maintenance parameters.
[0081] 步骤 420, 运维平台将所述运维模板保存至所述主节点服务器。 [0081] Step 420: The operation and maintenance platform saves the operation and maintenance template to the primary node server.
[0082] 具体地, 在用户端配置运维模板吋, 需要先判断是否使用与边缘节点服务器对 应的子模板, 若使用, 则继承主模板参数来配置子模板参数, 若不使用, 则直 接将运维模板的参数保存至主节点服务器。 Specifically, after configuring the operation and maintenance template on the user end, it is necessary to first determine whether to use the sub-template corresponding to the edge node server, and if so, inherit the main template parameter to configure the sub-template parameter, if not, directly The parameters of the operation and maintenance template are saved to the primary node server.
[0083] 进一步地, 当运维策略参数需要修改优化吋, 直接在运维平台上修改运维模板[0083] Further, when the operation and maintenance strategy parameters need to be modified and optimized, the operation and maintenance template is directly modified on the operation and maintenance platform.
, 并在指令发送到主节点服务器后, 主节点服务器就能直接与从节点服务器通 信, 即吋修改运维模板参数并生效。 After the command is sent to the master node server, the master node server can directly communicate with the slave node server, that is, modify the operation and maintenance template parameters and take effect.
[0084] 本实施例的服务器监控方法, 运维平台通过配置运维模板并将运维模板保证至 主节点服务器上, 通过使用运维模板, 来进行同类服务器运维的一致性处理, 通过模板的继承, 灵活处理同类型运维参数的差异化。 [0084] In the server monitoring method of the embodiment, the operation and maintenance platform configures the operation and maintenance template and ensures the operation and maintenance template to the primary node server, and uses the operation and maintenance template to perform the consistency processing of the same type of server operation and maintenance. Inheritance, flexible handling of the same type of operation and maintenance parameters.
[0085] 请参照图 5, 本发明第四实施例提供一种服务器监控系统, 该系统包括: 运维 平台 510、 与运维平台 510连接的用户端 520和主节点服务器 530、 该主节点服务 器 530与至少一个从节点服务器 540通信连接。 每个从节点服务器 540均对应一个 被监控主机 (图未示) 。 [0085] Referring to FIG. 5, a fourth embodiment of the present invention provides a server monitoring system, which includes: an operation and maintenance platform 510, a client 520 connected to the operation and maintenance platform 510, and a master node server 530, the master node server. 530 is in communication with at least one slave node server 540. Each slave node server 540 corresponds to a monitored host (not shown).
[0086] 主节点服务器 530, 设置为在运维模板中选择与被监控主机对应的模板参数, 并将所述模板参数发送至与所述被监控主机对应的从节点服务器 540。 [0086] The master node server 530 is configured to select a template parameter corresponding to the monitored host in the operation and maintenance template, and send the template parameter to the slave node server 540 corresponding to the monitored host.
[0087] 具体地, 在运维平台 510中, 独立部署一个主节点服务器 530, 多台被监控主机[0087] Specifically, in the operation and maintenance platform 510, one master node server 530 is independently deployed, and multiple monitored hosts are deployed.
, 每台被监控主机作为被监控端, 并在每台被监控主机上独立部署从节点服务 器 540, 来连接主节点服务器 530, 以即吋响应主节点服务器 530上下发的运维参 数指令, 并即吋上报运维数据和告警。 Each monitored host acts as a monitored terminal, and the slave node server 540 is independently deployed on each monitored host to connect to the master node server 530 to respond to the operation and maintenance parameter commands sent and received by the master node server 530, and That is, the operation and maintenance data and alarms are reported.
[0088] 主节点服务器 530上配置有运维模板, 该运维模板中具有与每个被监控主机对 应的模板参数, 从而主节点服务器 530选择与被监控主机对应的模板参数, 并将 该模板参数发送至监控被监控主机的从节点服务器 540。 [0088] The master node server 530 is configured with an operation and maintenance template, and the operation and maintenance template has a template parameter corresponding to each monitored host, so that the master node server 530 selects a template parameter corresponding to the monitored host, and the template is The parameters are sent to the slave node server 540 that monitors the monitored host.
[0089] 进一步地, 模板参数至少包括: 中央处理器 (Central Processing Unit, CPU) 占用率, 内存使用、 网络输入 /输出 (10) 、 硬盘输入 /输出 (10) 、 硬盘剩余空 间、 网络连接数、 重要服务端口监听情况、 以及部署的专用软件的自身各类参 数使用情况。 [0090] 进一步地, 由于目前服务器的多样性, 不同类型的从节点服务器 540需要监控 的参数不同, 例如: 流媒体转发服务器只需要关注 CPU占用率, 内存使用、 网络 10、 网络连接数以及服务本身的业务参数, 而存储服务器可能会关注的不同, 重点关注网络 10、 硬盘 10、 硬盘剩余空间和自身业务参数。 [0089] Further, the template parameters include at least: a central processing unit (CPU) occupancy rate, memory usage, network input/output (10), hard disk input/output (10), remaining space of the hard disk, and number of network connections. , important service port monitoring, and the use of various parameters of the deployed dedicated software. [0090] Further, due to the diversity of the current server, different types of slave node servers 540 need to monitor different parameters, for example: the streaming media forwarding server only needs to pay attention to CPU usage, memory usage, network 10, network connection number, and service. The business parameters of the server itself may be different from the storage parameters, focusing on the network 10, the hard disk 10, the remaining space of the hard disk, and its own business parameters.
[0091] 进一步地, 主干节点的被监控主机和边缘节点的被监控主机需要监控的参数不 同。  [0091] Further, the monitored host of the backbone node and the monitored host of the edge node need different parameters to be monitored.
[0092] 进一步地, 通过从节点服务器 540获取对应被监控主机的数据。  Further, the data corresponding to the monitored host is acquired from the node server 540.
[0093] 进一步地, 从节点服务器 540和主节点服务器 530具有统一的接口协议, 以保证 通讯的一致性。  [0093] Further, the slave node server 540 and the master node server 530 have a unified interface protocol to ensure communication consistency.
[0094] 从节点服务器 540, 设置为根据所述被监控主机产生的数据与所述模板参数进 行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 所述从节点服务 器 540上报所述数据至所述主节点服务器 530。  [0094] The slave node server 540 is configured to compare the data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server 540 The data is reported to the primary node server 530.
[0095] 具体地, 从节点服务器 540获取对应被监控主机的数据, 并将获取的数据与模 板参数进行比对, 当符合模板参数吋, 说明出现异常数据, 并将被监控主机的 运维数据上报至主节点服务器 530。  [0095] Specifically, the data corresponding to the monitored host is obtained from the node server 540, and the acquired data is compared with the template parameters. When the template parameters are met, the abnormal data is displayed, and the operation and maintenance data of the monitored host is performed. Reported to the primary node server 530.
[0096] 示例性地, 如图 2所示, 机房模板参数包括 CPU占用率、 内存占用、 网络 10、 硬盘 10、 硬盘剩余空间、 网络连接数、 重要服务端口状态等, 每个运维参数设 置对应的告警阈值为: al-a6。  [0096] Exemplarily, as shown in FIG. 2, the template parameters of the equipment room include CPU usage, memory usage, network 10, hard disk 10, remaining space of the hard disk, number of network connections, status of important service ports, and so on. The corresponding alarm threshold is: al-a6.
[0097] 机房中的被监控主机包括智能分析服务器、 存储服务器和流媒体转发服务器。  [0097] The monitored hosts in the equipment room include an intelligent analysis server, a storage server, and a streaming media forwarding server.
其中, 智能分析服务器的模板参数包括 CPU占用率和内存占用, 对应的告警阈值 为: bl-b2。 存储服务器的模板参数包括 CPU占用率、 内存占用、 网络 10、 硬盘 I 0、 硬盘剩余空间和网络连接参数, 对应的告警阈值为: cl-c6。 流媒体转发服务 器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连接数, 对应的告警 阈值为: dl-d4。 流媒体转发服务器还包括边缘节点的服务器: 核心网流媒体转 发服务器、 主干网流媒体转发服务器、 边缘流媒体转发服务器。 其中, 核心网 流媒体转发服务器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连接 数, 对应的告警阈值为: dl l-dl4。 主干网流媒体转发服务器的模板参数包括 CP U占用率、 内存占用、 网络 10和网络连接数, 对应的告警阈值为: d21-d24。 边 缘流媒体转发服务器的模板参数包括 CPU占用率、 内存占用、 网络 10和网络连 接数, 对应的告警阈值为: d31-d34。 在该示例中, 与边缘节点的服务器相比, 流媒体转发服务器为主干节点的服务器。 The template parameters of the intelligent analysis server include the CPU usage and the memory usage. The corresponding alarm threshold is bl-b2. The template parameters of the storage server include the CPU usage, the memory usage, the network 10, the hard disk I 0, the remaining space of the hard disk, and the network connection parameters. The corresponding alarm threshold is cl-c6. The template parameters of the streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections. The corresponding alarm threshold is dl-d4. The streaming media forwarding server further includes servers of edge nodes: a core network streaming media forwarding server, a backbone network streaming media forwarding server, and an edge streaming media forwarding server. The template parameters of the core network streaming media forwarding server include the CPU usage, the memory usage, the network 10, and the number of network connections. The corresponding alarm threshold is dl l-dl4. The template parameters of the backbone network forwarding server include the CP U occupancy rate, memory usage, network 10, and number of network connections. The corresponding alarm threshold is d21-d24. Side The template parameters of the edge streaming media forwarding server include CPU usage, memory usage, network 10, and number of network connections. The corresponding alarm threshold is d31-d34. In this example, the streaming media forwarding server is the server of the primary node compared to the server of the edge node.
[0098] 若将主干节点的流媒体转发服务器的模板参数 dl设置为 40%, 则当从节点服务 器 540获取的主干节点的流媒体转发服务器的 CPU占用率大于 40%吋, 则产生告 警并上报数据至主节点服务器 530。 而边缘流媒体转发服务器的模板参数 d31可 以设置为 30%, 贝 1」当从节点服务器 540获取的边缘流媒体转发服务器的 CPU占用 率大于 30%吋, 则产生告警并上报数据至主节点服务器 530。 也就是说, 在主干 节点服务器的基础上配置边缘节点服务器吋, 只要选择将主干节点的大部门模 板参数继承到边缘节点的模板参数即可, 而其他服务器的模板参数不需要改动 , 最大程度上降低了工作量。  [0098] If the template parameter dl of the streaming media forwarding server of the backbone node is set to 40%, when the CPU usage of the streaming media forwarding server of the backbone node acquired by the node server 540 is greater than 40%, an alarm is generated and reported. The data is sent to the primary node server 530. The template parameter d31 of the edge streaming media forwarding server may be set to 30%. When the CPU usage of the edge streaming media forwarding server acquired from the node server 540 is greater than 30%, an alarm is generated and the data is reported to the primary node server. 530. That is to say, the edge node server is configured on the basis of the backbone node server, as long as the template parameters of the trunk node are inherited to the template parameters of the edge node, and the template parameters of other servers do not need to be changed, to the greatest extent. Reduced the workload.
[0099] 主节点服务器 530, 还设置为将所述数据上报至运维平台 510。  [0099] The master node server 530 is further configured to report the data to the operation and maintenance platform 510.
[0100] 具体地, 主节点服务器 530将接收的数据上报至运维平台 510, 运维平台 510再 将数据上报至用户端 520, 使操作者通过用户端 520处理数据。  [0100] Specifically, the master node server 530 reports the received data to the operation and maintenance platform 510, and the operation and maintenance platform 510 reports the data to the client 520, so that the operator processes the data through the client 520.
[0101] 主节点服务器 530, 还设置为将在预定吋间段内多次接收从节点服务器 540上报 的数据生成告警参数值, 并判断所述告警参数值是否小于异常边界值, 若是, 则上报所述运维平台 510进行告警。  [0101] The master node server 530 is further configured to generate the alarm parameter value by receiving the data reported from the node server 540 multiple times in the predetermined interval, and determine whether the alarm parameter value is smaller than the abnormal boundary value, and if yes, report the value. The operation and maintenance platform 510 performs an alarm.
[0102] 具体地, 从节点服务器 540获取的运维参数值 X可以抽象化看作近似符合高斯分 布 (如果运维参数曲线非对称, 则可以使用 logX来代替 X进行处理, 使曲线尽量 趋于高斯分布) , 可以通过产生的历史运维数据来自动生成模板, 来判断是否 需要上报告警。  [0102] Specifically, the operation parameter value X obtained from the node server 540 can be abstracted as an approximate Gaussian distribution (if the operation parameter curve is asymmetric, logX can be used instead of X to process the curve as much as possible. Gaussian distribution), the template can be automatically generated by the historical operation and maintenance data generated to determine whether it is necessary to report the alarm.
[0103] 假如一段吋间 T内, 运维参数 xi产生 t条数据 {x(l),X(2),...,x(t)}条, 假设共有 j种运 维参数参与判断计算。 [0103] If there is a period of time T, the operation and maintenance parameter xi generates t data {x(l), X (2), ..., x(t)}, assuming that a total of j kinds of operation and maintenance parameters participate in the judgment calculation .
[0104] 运行过程中取平均值 uj:
Figure imgf000013_0001
[0104] averaging uj during operation:
Figure imgf000013_0001
[0105] 取标准差 oj:
Figure imgf000014_0001
[0105] Take the standard deviation oj:
Figure imgf000014_0001
[0106] 对告警参数值建立建立高斯函数:  [0106] Establishing a Gaussian function for the alarm parameter value:
Figure imgf000014_0002
Figure imgf000014_0002
[0107] 根据从节点服务器 540上报至主节点服务器 530的历史采样的 n条正常值 (剔除 采样吋间段内的所有异常值) , 判断异常边界值 s,  [0107] Based on the n normal values of the historical samples reported from the node server 540 to the primary node server 530 (excluding all outliers in the sampling interval), the abnormal boundary value s is determined,
[0108]  [0108]
^ =滅 /難 ^ = off / hard
[0109] 判断所述告警参数值 f(x)是否小于异常边界值 s, 若是, 则上报所述运维平台 51 0进行告警。 [0109] It is determined whether the alarm parameter value f(x) is smaller than the abnormal boundary value s, and if yes, the operation and maintenance platform 51 0 is reported to perform an alarm.
[0110] 更具体地, 当 f(x)<s吋, 则主节点服务器 530可以判断运维机器产生异常, 并上 报至主节点服务器 530进行告警。  [0110] More specifically, when f(x)<s吋, the master node server 530 can determine that the operation and maintenance machine generates an abnormality and report it to the master node server 530 for an alarm.
[0111] 主节点服务器 530, 还设置为自动生成告警模板, 并下发到所述从节点服务器 5 40、 以及上报至所述运维平台 510。  [0111] The master node server 530 is further configured to automatically generate an alarm template, and send it to the slave node server 5 40 and report to the operation and maintenance platform 510.
[0112] 具体地, 根据从节点服务器 540上报的告警数据自动生成告警模板, 并下发到 从节点服务器 540上, 以及上报至运维平台 510。  [0112] Specifically, the alarm template is automatically generated according to the alarm data reported from the node server 540, and is sent to the slave node server 540 and reported to the operation and maintenance platform 510.
[0113] 进一步地, 主节点服务器 530可通过 XML的数据承载方式以 TCP协议直接下发 到从节点服务器 540上即吋生效, 保障运维数据和告警获取的即吋性。  [0113] Further, the master node server 530 can be directly delivered to the slave node server 540 by using the data transfer mode of the XML to ensure the immediateness of the operation and maintenance data and the alarm acquisition.
[0114] 进一步地, 自动告警模板的生成是周期性的, 每隔一段吋间 T才会通过更大的 采样数据来计算, 避免多次计算的压力。  [0114] Further, the generation of the automatic alarm template is periodic, and the calculation of the larger sampling data is performed every other period of time, thereby avoiding the pressure of multiple calculations.
[0115] 进一步地, 当主节点服务器 530将告警模板下发至从节点服务器 540后, 则 T自  [0115] Further, when the master node server 530 sends the alarm template to the slave node server 540,
[0116] 进一步地, 主节点服务器 530—般不主动连接从节点服务器 540, 只有运维参数 需要改变吋主节点服务器 530才会主动与从节点服务器 540下发信令。 [0117] 运维平台 510, 设置为将所述告警模板上报至用户端 520。 [0116] Further, the master node server 530 does not actively connect to the slave node server 540. Only when the operation and maintenance parameters need to be changed, the master node server 530 actively sends signaling to the slave node server 540. [0117] The operation and maintenance platform 510 is configured to report the alarm template to the client 520.
[0118] 运维平台 510, 还设置为接收用户端 520配置的运维模板、 以及将所述运维模板 保存至所述主节点服务器 530。  [0118] The operation and maintenance platform 510 is further configured to receive an operation and maintenance template configured by the client 520, and save the operation and maintenance template to the primary node server 530.
[0119] 具体地, 操作者通过用户端 520配置运维模板, 用户端 520将运维模板上传至运 维平台 510, 使运维平台 510接收该运维模板。  [0119] Specifically, the operator configures the operation and maintenance template through the client 520, and the client 520 uploads the operation and maintenance template to the operation and maintenance platform 510, so that the operation and maintenance platform 510 receives the operation and maintenance template.
[0120] 主节点服务器 530上可手动或者自动配置从节点服务器 540的模板, 同类型的被 监控主机可使用一个监控模板, 模板包括该类型服务器所需要的监控参数。 对 不同类型的被监控主机主机设置不同的运维模板, 模板里规定了被监控主机里 的运维参数, 包括硬件性能参数、 和部署的软件服务参数。 通过模板参数的设 置, 规定了被监控主机的运维数据是否上报, 包括告警阈值、 以及历史运维数 据是否存储等。  [0120] The master node server 530 can manually or automatically configure the template of the slave node server 540. The same type of monitored host can use a monitoring template, and the template includes monitoring parameters required by the type server. Set different operation and maintenance templates for different types of monitored host hosts. The template specifies the operation and maintenance parameters of the monitored host, including hardware performance parameters and deployed software service parameters. The setting of the template parameters specifies whether the operation and maintenance data of the monitored host is reported, including the alarm threshold and whether the historical operation and maintenance data is stored.
[0121] 主节点服务器 530上配置的手动模板可以继承, 当需要增加不同的服务器吋, 如果和已有模板只有少量改动, 则可以继承原来模板大部分参数, 只需要修改 少量参数即可生成新的子模板, 模板以树状结构继承。  [0121] The manual template configured on the primary node server 530 can be inherited. When different servers need to be added, if there are only a few changes with the existing template, most of the parameters of the original template can be inherited, and only a small number of parameters need to be modified to generate a new one. The child template, the template inherits in a tree structure.
[0122] 运维的模板参数除了手动配置外, 也可自动通过历史运维数据来产生新的模板 , 来达到运维参数的最优化配置。  [0122] In addition to manual configuration, the template parameters of the operation and maintenance can also automatically generate new templates through historical operation and maintenance data to achieve optimal configuration of operation and maintenance parameters.
[0123] 在用户端 520配置运维模板吋, 需要先判断是否使用与边缘节点服务器对应的 子模板, 若使用, 则继承主模板参数来配置子模板参数, 若不使用, 则直接将 运维模板的参数保存至主节点服务器 530。  [0123] After the operation and maintenance template is configured on the user end 520, it is necessary to first determine whether to use the sub-template corresponding to the edge node server. If yes, the main template parameter is inherited to configure the sub-template parameter, and if not used, the operation and maintenance are directly performed. The parameters of the template are saved to the master node server 530.
[0124] 进一步地, 当运维策略参数需要修改优化吋, 直接在运维平台 510上修改运维 模板, 并在指令发送到主节点服务器 530后, 主节点服务器 530就能直接与从节 点服务器 540通信, 即吋修改运维模板参数并生效。  [0124] Further, when the operation and maintenance policy parameters need to be modified, the operation and maintenance template is directly modified on the operation and maintenance platform 510, and after the instruction is sent to the primary node server 530, the primary node server 530 can directly interact with the secondary node server. 540 communication, that is, modify the operation and maintenance template parameters and take effect.
[0125] 本实施例的服务器监控系统, 通过主节点服务器 530在运维模板中选择与被监 控主机对应的模板参数, 并将模板参数发送至与被监控主机对应的从节点服务 器 540, 从节点服务器 540根据被监控主机产生的数据与模板参数进行比对, 当 被监控主机产生的数据符合模板参数吋, 从节点服务器 540上报数据至主节点服 务器 530, 主节点服务器 530将数据上报至运维平台 510。 从而减少了不同类型的 服务器在运维系统中运维参数获取的复杂性, 通过主从节点服务器的部署方式 , 来进行系统的运维统一化管理, 通过模板的使用, 来进行同类服务器运维的 一致性处理, 通过模板的继承, 灵活处理同类型运维参数的差异化。 [0125] The server monitoring system of the embodiment selects a template parameter corresponding to the monitored host in the operation and maintenance template by the master node server 530, and sends the template parameter to the slave node server 540 corresponding to the monitored host, and the slave node. The server 540 compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the slave node server 540 reports the data to the master node server 530, and the master node server 530 reports the data to the operation and maintenance. Platform 510. Therefore, the complexity of obtaining operation and maintenance parameters of different types of servers in the operation and maintenance system is reduced, and the deployment mode of the master-slave node server is adopted. To manage the operation and maintenance of the system in a unified manner, use the template to perform the consistency processing of the same type of server operation and maintenance, and flexibly handle the differentiation of the same type of operation and maintenance parameters through template inheritance.
[0126] 需要说明的是, 在本文中, 术语"包括"或者其任何其他变体意在涵盖非排他性 的包含, 从而使得包括一系列要素的过程、 方法、 物品或者系统不仅包括那些 要素, 而且还包括没有明确列出的其他要素, 或者是还包括为这种过程、 方法 、 物品或者系统所固有的要素。 在没有更多限制的情况下, 由语句 "包括一个… …"限定的要素, 并不排除在包括该要素的过程、 方法、 物品或者系统中还存在 另外的相同要素。 [0126] It should be noted that, the term "comprising" or any other variation thereof is intended to encompass a non-exclusive inclusion, such that a process, method, article, or system that includes a series of elements includes not only those elements but also It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, item, or system. An element defined by the statement "comprising a ..." without further restrictions does not exclude the existence of additional identical elements in the process, method, item, or system that includes the element.
[0127] 上述本发明实施例序号仅仅为了描述, 不代表实施例的优劣。  [0127] The foregoing serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
[0128] 通过以上的实施方式的描述, 本领域的技术人员可以清楚地了解到上述实施例 方法可借助软件加必需的通用硬件平台的方式来实现, 当然也可以通过硬件, 但很多情况下前者是更佳的实施方式。 基于这样的理解, 本发明的技术方案本 质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来, 该计 算机软件产品存储在一个存储介质 (如 ROM/RAM、 磁碟、 光盘) 中, 包括若干 指令用以使得一台终端设备 (可以是手机, 计算机, 服务器, 空调器, 或者网 络设备等) 执行本发明各个实施例所述的方法。 [0128] Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former It is a better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
[0129] 以上仅为本发明的优选实施例, 并非因此限制本发明的专利范围, 凡是利用本 发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接运用在 其他相关的技术领域, 均同理包括在本发明的专利保护范围内。  The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the contents of the drawings may be directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of the present invention.
工业实用性  Industrial applicability
[0130] 本发明提出的服务器监控方法及系统, 通过主节点服务器在运维模板中选择与 被监控主机对应的模板参数, 并将模板参数发送至与被监控主机对应的从节点 服务器, 从节点服务器根据被监控主机产生的数据与模板参数进行比对, 当被 监控主机产生的数据符合模板参数吋, 从节点服务器上报数据至主节点服务器 , 主节点服务器将数据上报至运维平台。 从而减少了不同类型的服务器在运维 系统中运维参数获取的复杂性, 通过主从节点服务器的部署方式, 来进行系统 的运维统一化管理, 通过模板的使用, 来进行同类服务器运维的一致性处理, 通过模板的继承, 灵活处理同类型运维参数的差异化。  [0130] The server monitoring method and system provided by the present invention selects a template parameter corresponding to the monitored host in the operation and maintenance template by the primary node server, and sends the template parameter to the slave node server corresponding to the monitored host, the slave node. The server compares the data generated by the monitored host with the template parameters. When the data generated by the monitored host meets the template parameters, the data is reported from the node server to the primary node server, and the primary node server reports the data to the operation and maintenance platform. Therefore, the complexity of the operation and maintenance parameters acquisition of different types of servers in the operation and maintenance system is reduced, and the operation and maintenance of the system is unified and managed through the deployment mode of the master-slave node server, and the same server operation and maintenance is performed through the use of the template. Consistency processing, flexible processing of the same type of operation and maintenance parameters through template inheritance.

Claims

权利要求书  Claim
一种服务器监控方法, 所述方法包括: A server monitoring method, the method comprising:
主节点服务器在运维模板中选择与被监控主机对应的模板参数, 并将 所述模板参数发送至与所述被监控主机对应的从节点服务器; 所述从节点服务器根据所述被监控主机产生的数据与所述模板参数进 行比对, 当所述被监控主机产生的数据符合所述模板参数吋, 则所述 从节点服务器上报所述数据至所述主节点服务器; The master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, and sends the template parameter to the slave node server corresponding to the monitored host; the slave node server generates according to the monitored host. The data is compared with the template parameter, and when the data generated by the monitored host meets the template parameter, the slave node server reports the data to the master node server;
所述主节点服务器将所述数据上报至运维平台。 The master node server reports the data to the operation and maintenance platform.
根据权利要求 1所述的服务器监控方法, 其中, 所述方法还包括: 所述主节点服务器将在预定吋间段内多次接收从节点服务器上报的数 据生成告警参数值, 并判断所述告警参数值是否小于异常边界值, 若 是, 则所述主节点服务器上报所述运维平台进行告警。 The server monitoring method according to claim 1, wherein the method further comprises: the primary node server receiving the data reported from the node server for generating the alarm parameter value in the predetermined interval, and determining the alarm. If the parameter value is smaller than the abnormal boundary value, the master node server reports the operation and maintenance platform to perform an alarm.
根据权利要求 2所述的服务器监控方法, 其中, 所述方法还包括: 所述主节点服务器自动生成告警模板, 并下发到所述从节点服务器、 以及上报至所述运维平台; The server monitoring method according to claim 2, wherein the method further comprises: the master node server automatically generating an alarm template, and transmitting the template to the slave node server and reporting to the operation and maintenance platform;
所述运维平台将所述告警模板上报至用户端。 The operation and maintenance platform reports the alarm template to the user end.
根据权利要求 1所述的服务器监控方法, 其中, 在所述主节点服务器 在运维模板中选择与被监控主机对应的模板参数之前, 所述方法还包 括: The server monitoring method according to claim 1, wherein before the master node server selects a template parameter corresponding to the monitored host in the operation and maintenance template, the method further includes:
所述运维平台接收用户端配置的运维模板; The operation and maintenance platform receives an operation and maintenance template configured by the user end;
所述运维平台将所述运维模板保存至所述主节点服务器。 The operation and maintenance platform saves the operation and maintenance template to the primary node server.
根据权利要求 1-4任一项所述的服务器监控方法, 其中, 所述模板参 数包括中央处理器占用率, 内存使用、 网络输入 /输出、 硬盘输入 /输 出、 硬盘剩余空间、 网络连接数、 重要服务端口监听情况、 以及部署 的专用软件的自身各类参数使用情况。 The server monitoring method according to any one of claims 1 to 4, wherein the template parameters include a central processor occupancy rate, memory usage, network input/output, hard disk input/output, remaining space of the hard disk, number of network connections, Important service port listening conditions, as well as the use of various parameters of the deployed dedicated software.
一种服务器监控系统, 所述系统包括主节点服务器、 至少一个被监控 主机、 与所述监控主机对应的从节点服务器、 以及运维平台, 其中, 所述主节点服务器, 设置为在运维模板中选择与所述被监控主机对应 的模板参数, 并将所述模板参数发送至与所述被监控主机对应的所述 从节点服务器; A server monitoring system, the system includes a master node server, at least one monitored host, a slave node server corresponding to the monitoring host, and an operation and maintenance platform, wherein the master node server is set to be in an operation and maintenance template Selecting to correspond to the monitored host a template parameter, and sending the template parameter to the slave node server corresponding to the monitored host;
所述从节点服务器, 设置为根据所述被监控主机产生的数据与所述模 板参数进行比对, 当所述被监控主机产生的数据符合所述模板参数吋 , 则上报所述数据至所述主节点服务器;  The slave node server is configured to compare the data generated by the monitored host with the template parameter, and when the data generated by the monitored host meets the template parameter, report the data to the Primary node server;
所述主节点服务器, 还设置为将所述数据上报至运维平台。  The primary node server is further configured to report the data to the operation and maintenance platform.
[权利要求 7] 根据权利要求 6所述的服务器监控系统, 其中, 所述主节点服务器, 还设置为将在预定吋间段内多次接收从节点服务器上报的数据生成告 警参数值, 并判断所述告警参数值是否小于异常边界值, 若是, 则上 报所述运维平台进行告警。  [Claim 7] The server monitoring system according to claim 6, wherein the master node server is further configured to generate an alarm parameter value by receiving data reported from the node server multiple times in a predetermined interval, and determine Whether the alarm parameter value is smaller than the abnormal boundary value, and if yes, reporting the operation and maintenance platform to perform an alarm.
[权利要求 8] 根据权利要求 7所述的服务器监控系统, 其中, 所述主节点服务器, 还设置为自动生成告警模板, 并下发到所述从节点服务器、 以及上报 至所述运维平台;  [Claim 8] The server monitoring system according to claim 7, wherein the master node server is further configured to automatically generate an alarm template, and send the template to the slave node server and report to the operation and maintenance platform. ;
所述运维平台, 设置为将所述告警模板上报至用户端。  The operation and maintenance platform is configured to report the alarm template to the user end.
[权利要求 9] 根据权利要求 6所述的服务器监控系统, 其中, 所述运维平台, 还设 置为接收用户端配置的运维模板, 以及将所述运维模板保存至所述主 节点服务器。  [Claim 9] The server monitoring system according to claim 6, wherein the operation and maintenance platform is further configured to receive an operation and maintenance template configured by the user end, and save the operation and maintenance template to the primary node server .
[权利要求 10] 根据权利要求 6-9任一项所述的服务器监控系统, 其中, 所述模板参 数包括中央处理器占用率, 内存使用、 网络输入 /输出、 硬盘输入 /输 出、 硬盘剩余空间、 网络连接数、 重要服务端口监听情况、 以及部署 的专用软件的自身各类参数使用情况。  The server monitoring system according to any one of claims 6 to 9, wherein the template parameters include central processor occupancy, memory usage, network input/output, hard disk input/output, and remaining space of the hard disk. , the number of network connections, the monitoring of important service ports, and the use of various parameters of the deployed dedicated software.
PCT/CN2017/085437 2017-05-23 2017-05-23 Server monitoring method and system WO2018214009A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/085437 WO2018214009A1 (en) 2017-05-23 2017-05-23 Server monitoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/085437 WO2018214009A1 (en) 2017-05-23 2017-05-23 Server monitoring method and system

Publications (1)

Publication Number Publication Date
WO2018214009A1 true WO2018214009A1 (en) 2018-11-29

Family

ID=64396024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/085437 WO2018214009A1 (en) 2017-05-23 2017-05-23 Server monitoring method and system

Country Status (1)

Country Link
WO (1) WO2018214009A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100156623A1 (en) * 2008-12-24 2010-06-24 At&T Intellectual Property I, L.P. Method and Apparatus for Network Service Assurance
CN104010028A (en) * 2014-05-04 2014-08-27 华南理工大学 Dynamic virtual resource management strategy method for performance weighting under cloud platform
CN104935464A (en) * 2015-06-12 2015-09-23 北京奇虎科技有限公司 Fault predicting method of website system and device
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100156623A1 (en) * 2008-12-24 2010-06-24 At&T Intellectual Property I, L.P. Method and Apparatus for Network Service Assurance
CN104010028A (en) * 2014-05-04 2014-08-27 华南理工大学 Dynamic virtual resource management strategy method for performance weighting under cloud platform
CN104935464A (en) * 2015-06-12 2015-09-23 北京奇虎科技有限公司 Fault predicting method of website system and device
CN105323111A (en) * 2015-11-17 2016-02-10 南京南瑞集团公司 Operation and maintenance automation system and method

Similar Documents

Publication Publication Date Title
CN108322345B (en) Method for issuing fault repair data packet and server
JP5002647B2 (en) System and method for managing a network
US8429255B1 (en) Determining reorder commands for remote reordering of policy rules
US20130046865A1 (en) Zero configuration of a virtual distributed device
JP2019524013A5 (en)
CN105049502B (en) The method and apparatus that device software updates in a kind of cloud network management system
EP1639490B1 (en) System and method for agent-based monitoring of network devices
WO2017198003A1 (en) Service processing method and system
CN106453124A (en) Traffic scheduling method and device
WO2018049545A1 (en) Data processing method, apparatus, and system in sdn, electronic device, and computer program product
US20170192796A1 (en) Methods and systems for configuring a device using a firmware configuration block
CN110011840A (en) Condition processing method, device and the controller of controller
US10523547B2 (en) Methods, systems, and computer readable media for multiple bidirectional forwarding detection (BFD) session optimization
CN112491700A (en) Network path adjusting method, system, device, electronic equipment and storage medium
WO2021008484A1 (en) Port mode self-adaptive method and apparatus
WO2015154366A1 (en) Policy-based m2m terminal device monitoring and control method and device
CN108712743B (en) Method and system for managing wireless networking of device groups
WO2018214009A1 (en) Server monitoring method and system
CN106169982B (en) Method, device and system for processing expansion port
US9575865B2 (en) Information processing system and monitoring method
US20220239572A1 (en) Data Processing Method, Device, and System
JP4609331B2 (en) Complex information platform apparatus and communication bandwidth guarantee method thereof
WO2014000290A1 (en) Method, device and system for controlling data packets
CN107018033B (en) Self-adjusting cloud management system
WO2010084394A1 (en) Systems and methods for changing the address of an interface

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17911323

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17911323

Country of ref document: EP

Kind code of ref document: A1